# Generate a GML file for network analysis

This notebook accompanies

> Hall, M (2017). Three data analytics party tricks. _The Leading Edge_ **36** (3).

I am exchanging data with `NetworkX` (the graph analysis package) in [**Graph Modelling Language**](https://en.wikipedia.org/wiki/Graph_Modelling_Language) (not to be confused with Graph Markup Language, which is a dialect of XML). 

This is the first time I've met this format so I am going to do this in a pretty naïve way. 

In [1]:
import pandas as pd
import unidecode
import re
from collections import defaultdict
from itertools import combinations

Read the file I made from the full bibliographic database. It is a list of lists, in which each inner list is a list of authors for a since paper. The list-of-lists is stored as a Python 'pickle' file. I'm reading it into Pandas, which may or may not turn out to be convenient later. 

In [2]:
ds = pd.read_pickle('data/authors.pkl')

In [3]:
ds.head()

7566    [M. N. Nabighian, V. J. S. Grauch, R. O. Hanse...
6095             [Yoshio Ueda, Ryuji Kubota, Jiro Segawa]
3289                      [Philip S. Schultz, August Lau]
4218                      [Norman S. Neidell, Neal Berry]
7167                       [Hengchang Dai, Xiang-Yang Li]
Name: authors, dtype: object

## Define some preprocessing

Dealing with names from a database is almost always trickjy. There are weird characters (most of which I should already have dealt with, outside of this notebook), and variations in spelling for a given person's name. Some of these variations are very hard to deal with, so we're not going to end up with perfect data. 

This is a bit hacky...

- To avoid confusing `L Lines` with `Larry Lines` and `LR Lines` and `Laurence R Lines`, etc, I am going to keep only initials plus surnames: `LLines`. 
- I also want to maintain bits of surnames like 'van de', so `MVanDerBaan` for `MirkoVanDerBaan`. 
- However, if a surname has 4 or fewer characters, I will keep the full name, so `MattHall` for `Matt Hall` (otherwise we get a lot of collisions on short, common names like `Li`). 

In [4]:
def process_name(name):
    name = unidecode.unidecode(name).title()
    name = re.sub(r"[-\.']", r'', name).strip()
    bits = name.split()
    for bit in bits:
        extra = ''
        if bit in ['Van', 'Von', 'Der', 'De', 'Da', 'Di', 'Du', 'Della', 'Dos', 'Al', 'El', 'La', 'Le', 'St']:
            extra += bit
    if len(bits[-1]) < 5:
        return bits[0] + extra + bits[-1]
    return name[0] + extra + bits[-1]

## Collect author data

We'll use this as we loop over the dataframe (yes, there's probably a much better way to do this!).

The GLM format has two parts: a node list, and an edge list.

Node list part:

    node [
      id 72
      label "Toussaint"
    ]
    node [
      id 73
      label "Child1"
    ]

Edge list part:

    edge [
      source 19
      target 17
      value 4
    ]
    edge [
      source 19
      target 18
      value 4
    ]

Let's make a dict first, like:

    { Brown: {Smith: 1, Jones: 3, Einstein: 23},
      Smith: {Brown: 1, Doe: 3},
    }
    
Undirected so we can just use combinations:

In [5]:
auth_nodes = []
auth_edges = defaultdict(lambda: defaultdict(int))

for auth_list in ds:
    
    # Preprocess
    auths = sorted(process_name(auth) for auth in auth_list)
    auth_nodes += auths

    # Build edge dict.
    combs = combinations(auths, 2)
    for (src, trgt) in combs:
        auth_edges[src][trgt] += 1

auth_nodes = list(set(auth_nodes))

In [6]:
auth_nodes

['CReeves',
 'SBiehler',
 'SamuelKarp',
 'JSchleicher',
 'RWatts',
 'JSchmoker',
 'DaleCox',
 'GSimmons',
 'CPower',
 'RGodfrey',
 'LKonstantaki',
 'DaniOr',
 'RHorne',
 'IBeresnev',
 'JWong',
 'HWashburn',
 'DBlair',
 'MGreenhalgh',
 'APoikonen',
 'GerhardLukk',
 'JJohnson',
 'AQuarteroni',
 'QingyunDiDi',
 'JKoski',
 'JBerryhill',
 'ABarthelmes',
 'JAllingham',
 'PGarossino',
 'DAnderson',
 'MJones',
 'GRice',
 'RLopes',
 'DMagnier',
 'WMcguinness',
 'JPatch',
 'ZengliDuDu',
 'CCamerlynck',
 'MZhang',
 'LBorgman',
 'DGough',
 'CVaughan',
 'DKinman',
 'SAltaner',
 'KLau',
 'MBuonora',
 'MHewitt',
 'RobertBaum',
 'SUpadhyay',
 'BSeymour',
 'ZhengpingLiu',
 'LBerryman',
 'AHusseiny',
 'EKelly',
 'YingjieGao',
 'ECaspari',
 'ABeck',
 'ArmandoSena',
 'EmilyFay',
 'JAllsop',
 'CanYang',
 'GYadav',
 'JCarvalho',
 'XiangDuDu',
 'GNover',
 'JohnBeck',
 'AndrejBona',
 'ZhanxiangHe',
 'WFilipo',
 'JorgenPihl',
 'OPedersen',
 'AKuvshinov',
 'SThiel',
 'JTrier',
 'DAdkinsHeljeson',
 'CDresbach',


In [7]:
auth_edges

defaultdict(<function __main__.<lambda>>,
            {'MKnoll': defaultdict(int,
                         {'PRouth': 3,
                          'RKnight': 1,
                          'TJohnson': 1,
                          'WBarrash': 5,
                          'WClement': 4}),
             'MPrasad': defaultdict(int,
                         {'MSaidian': 1,
                          'MZimmer': 2,
                          'QifeiNiu': 1,
                          'RMeissner': 1,
                          'SZargari': 2,
                          'TWilkinson': 1,
                          'WWoodruff': 2}),
             'CReeves': defaultdict(int, {'CTarlowski': 1, 'NPaterson': 1}),
             'SBiehler': defaultdict(int,
                         {'StephenPark': 1,
                          'TienChangLee': 1,
                          'TiengChangLee': 1,
                          'WBaldridge': 1,
                          'WStephenson': 1}),
             'GSzurek': defaultdict(in

## Unique IDs

Because of how GML works, we really need a unique ID for each author. So let's make a dictionary mapping each one to a unique integer:

In [8]:
aid = {k:i for i, k in enumerate(sorted(auth_nodes))}

In [9]:
aid

{'TScheuer': 8526,
 'MKnoll': 5785,
 'MPrasad': 5955,
 'CReeves': 1299,
 'DGoldberg': 1700,
 'GSzurek': 3003,
 'YiguangHu': 9439,
 'AKuckes': 325,
 'WBrisbin': 8845,
 'YErlangga': 9318,
 'SamuelKarp': 8178,
 'BSternberg': 928,
 'JiangLi': 4577,
 'WLancaster': 8953,
 'MGrech': 5682,
 'HWang': 3400,
 'KBube': 4767,
 'MDehghannejad': 5604,
 'NYadari': 6394,
 'WGillingham': 8890,
 'JSchleicher': 4327,
 'POkoye': 6790,
 'DRichter': 1896,
 'WSchneider': 9025,
 'PMcgowan': 6753,
 'XingzhouLiu': 9256,
 'RWatts': 7577,
 'JLea': 4091,
 'RobertBrod': 7645,
 'CWalker': 1398,
 'JSchmoker': 4328,
 'DaleCox': 2019,
 'GSimmons': 2992,
 'RBailey': 7029,
 'RaoulVajk': 7620,
 'WentaoMu': 9135,
 'CPower': 1289,
 'LianghuiGuo': 5408,
 'RGodfrey': 7197,
 'DDrahos': 1642,
 'KHambacker': 4815,
 'DaniOr': 2020,
 'JHaldorsen': 3967,
 'DSmit': 1946,
 'XiaoniuZeng': 9230,
 'IBeresnev': 3558,
 'DShillington': 1934,
 'GuangjieWang': 3100,
 'JWong': 4490,
 'MotoakiSato': 6222,
 'HWashburn': 3402,
 'DBlair': 1564,
 '

## Build GML file parts

Now we can iterate over the nodes to build up the node-list part of the GML file:

In [10]:
import itertools
import numpy as np

node_template = """  node [
    id {}
    label "{}"
  ]
"""

auth_node_text = ''

for auth in auth_nodes:
    auth_node_text += node_template.format(aid[auth], auth)
    
print(auth_node_text[:202])

  node [
    id 1299
    label "CReeves"
  ]
  node [
    id 7736
    label "SBiehler"
  ]
  node [
    id 8178
    label "SamuelKarp"
  ]
  node [
    id 4327
    label "JSchleicher"
  ]
  node [
    i


And a similar thing for the edge list part:

In [11]:
edge_template = """  edge [
    source {}
    target {}
    value {}
  ]
"""

auth_edge_text = ''

for auth, coauths in auth_edges.items():
    for coauth, value in coauths.items():
        auth_edge_text += edge_template.format(aid[auth], aid[coauth], value)

print(auth_edge_text[:226])

  edge [
    source 5785
    target 7302
    value 1
  ]
  edge [
    source 5785
    target 8861
    value 4
  ]
  edge [
    source 5785
    target 8421
    value 1
  ]
  edge [
    source 5785
    target 6821
    value 3
  


## Write out file

In [12]:
gml_template = """Creator "Matt Hall and SEG"
graph [
  comment "Coauthorship graph, SEG Geophysics"
  directed 0
  
{}{}
]
"""

auth_gml_text = gml_template.format(auth_node_text, auth_edge_text)

In [13]:
with open('data/coauthors.gml', 'w', encoding='ascii') as f:
    f.write(auth_gml_text)

Now we can go off and explore this dataset in [Coauthor_network_analysis.ipynb](Coauthor_network_analysis.ipynb).

<hr />

&copy; 2017 Agile Scientific, licensed under CC-BY (text) and Apache 2.0 (code)