In [2]:
import numpy as np
import json

from ahyper import utils, annotated_hypergraph

In [3]:
with open('data/enron_hypergraph_annotated.json') as file:
    data = json.load(file)

roles = ['cc', 'from', 'to']

In [4]:
data[0:2]

[{'date': '1998-11-13 12:07:00', 'from': [67], 'to': [108], 'cc': []},
 {'date': '1998-11-19 15:19:00', 'from': [67], 'to': [73], 'cc': []}]

# Construct an Annotated Hypergraph

In [5]:
A = annotated_hypergraph.AnnotatedHypergraph.from_records(data, roles)

First, `A` stores lists of the node and edge ids. Nodes are numbered from $0$ to $n-1$. Edges are numbered from $-1$ to $-m$. 

In [6]:
A.get_node_list()[0:10]

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8, 10])

In [7]:
A.get_edge_list()[0:10]

array([-10504, -10503, -10502, -10501, -10500, -10499, -10498, -10497,
       -10496, -10495])

We can also get the node degree sequence, optionally broken down by role: 

In [8]:
# degree sequence
A.node_degrees() # get all node degrees (totals)
A.node_degrees(by_role = True)[4] # get the role-degrees of node 4 

{'cc': 0, 'from': 12, 'to': 5}

Similarly, we can get the edge dimension sequence, again optionally broken down by role: 

In [9]:
# edge dimension sequence

A.edge_dimensions() # get all edge dimensions (totals)
A.edge_dimensions(by_role = True)[-5] # get the role-dimensions of edge -5

{'cc': 2, 'from': 1, 'to': 0}

Internally, `A` is representing the data as an annotated node-edge incidence list. It's convenient to think of this as the edge-list of the bipartite graph in which each edge is labeled with a name and a role. It is possible to access the list directly: 

In [10]:
A.get_IL()[0:10]

[NodeEdgeIncidence(nid=67, role='from', eid=-1, meta={'date': '1998-11-13 12:07:00'}),
 NodeEdgeIncidence(nid=108, role='to', eid=-1, meta={'date': '1998-11-13 12:07:00'}),
 NodeEdgeIncidence(nid=67, role='from', eid=-2, meta={'date': '1998-11-19 15:19:00'}),
 NodeEdgeIncidence(nid=73, role='to', eid=-2, meta={'date': '1998-11-19 15:19:00'}),
 NodeEdgeIncidence(nid=73, role='cc', eid=-3, meta={'date': '1998-11-19 16:24:00'}),
 NodeEdgeIncidence(nid=67, role='from', eid=-3, meta={'date': '1998-11-19 16:24:00'}),
 NodeEdgeIncidence(nid=108, role='cc', eid=-4, meta={'date': '1998-11-24 10:23:00'}),
 NodeEdgeIncidence(nid=96, role='cc', eid=-4, meta={'date': '1998-11-24 10:23:00'}),
 NodeEdgeIncidence(nid=22, role='cc', eid=-4, meta={'date': '1998-11-24 10:23:00'}),
 NodeEdgeIncidence(nid=67, role='from', eid=-4, meta={'date': '1998-11-24 10:23:00'})]

# Stub-Labeled MCMC 

We can define a simple version of stub-labeled Markov Chain Monte Carlo in this space, which essentially amounts to swapping edges of the bipartite graph in such a way that edges can only be swapped if their roles agree. This MCMC algorithm preserves degree sequence and edge dimension sequence, including the `by_role` variants. 

# Check for preservation of node degrees and edge dimensions

In [11]:
d0 = A.node_degrees(by_role = True)
k0 = A.edge_dimensions(by_role = True)

In [12]:
A.stub_labeled_MCMC(n_steps = 100000)
A.get_IL()[0:10] # not the same list as above

[NodeEdgeIncidence(nid=73, role='from', eid=-1, meta={'date': '1998-11-13 12:07:00'}),
 NodeEdgeIncidence(nid=86, role='to', eid=-1, meta={'date': '1998-11-13 12:07:00'}),
 NodeEdgeIncidence(nid=96, role='from', eid=-2, meta={'date': '1998-11-19 15:19:00'}),
 NodeEdgeIncidence(nid=87, role='to', eid=-2, meta={'date': '1998-11-19 15:19:00'}),
 NodeEdgeIncidence(nid=41, role='cc', eid=-3, meta={'date': '1998-11-19 16:24:00'}),
 NodeEdgeIncidence(nid=108, role='from', eid=-3, meta={'date': '1998-11-19 16:24:00'}),
 NodeEdgeIncidence(nid=60, role='cc', eid=-4, meta={'date': '1998-11-24 10:23:00'}),
 NodeEdgeIncidence(nid=101, role='cc', eid=-4, meta={'date': '1998-11-24 10:23:00'}),
 NodeEdgeIncidence(nid=87, role='cc', eid=-4, meta={'date': '1998-11-24 10:23:00'}),
 NodeEdgeIncidence(nid=36, role='from', eid=-4, meta={'date': '1998-11-24 10:23:00'})]

In [13]:
d = A.node_degrees(by_role = True)
k = A.edge_dimensions(by_role = True)

In [14]:
d0 == d, k0 == k # but the degree and dimension sequences are preserved

(True, True)

# Output

You can read out data from `A` either as a list of dicts ("records", suitable for output as json) or as an incidence list. 

In [15]:
A.get_records()[0:2]

[{'cc': [],
  'from': [73],
  'to': [86],
  'eid': -1,
  'date': '1998-11-13 12:07:00'},
 {'cc': [],
  'from': [96],
  'to': [87],
  'eid': -2,
  'date': '1998-11-19 15:19:00'}]

In [16]:
A.get_IL()[0:4]

[NodeEdgeIncidence(nid=73, role='from', eid=-1, meta={'date': '1998-11-13 12:07:00'}),
 NodeEdgeIncidence(nid=86, role='to', eid=-1, meta={'date': '1998-11-13 12:07:00'}),
 NodeEdgeIncidence(nid=96, role='from', eid=-2, meta={'date': '1998-11-19 15:19:00'}),
 NodeEdgeIncidence(nid=87, role='to', eid=-2, meta={'date': '1998-11-19 15:19:00'})]

# Observables

For an annotated hypergraph null model to make sense we need observables on the hypergraph that take the node roles into account. 

Check: clustering coefficients will also change upon a shuffle, however it will differ from the non-annotated case.
The clustering coefficients are only calculated on the simplified graph projection.

One observable is the local role density around a node.
For a focal node $i$, this is the sum of roles, over all neighbours of $i$, and over all edges. This can include the focal node, or not.
We expect that this varies strongly across a graph. 
For example, in a an authorship graph, an author who is first author on a large number of papers will more likely be surrounded by authors who are middle or last authors.

In [17]:
from ahyper.observables import local_role_density

In [19]:
A = annotated_hypergraph.AnnotatedHypergraph.from_records(data, roles)

densities = local_role_density(A, include_focus=False)
densities

{73: Counter({'from': 0.17114914425427874,
          'cc': 0.15403422982885084,
          'to': 0.6748166259168704}),
 108: Counter({'cc': 0.13152866242038216,
          'from': 0.13312101910828025,
          'to': 0.7353503184713376}),
 96: Counter({'cc': 0.11622708985248101,
          'from': 0.24541797049620026,
          'to': 0.6383549396513187}),
 22: Counter({'cc': 0.08975175047740293,
          'from': 0.2138765117759389,
          'to': 0.6963717377466582}),
 66: Counter({'cc': 0.2321291314373559,
          'from': 0.2275172943889316,
          'to': 0.5403535741737125}),
 39: Counter({'cc': 0.08820882088208822,
          'from': 0.355985598559856,
          'to': 0.5558055805580558}),
 15: Counter({'cc': 0.17428087986463622,
          'from': 0.2182741116751269,
          'to': 0.6074450084602369}),
 100: Counter({'cc': 0.06228610540725531,
          'from': 0.31759069130732376,
          'to': 0.6201232032854209}),
 64: Counter({'from': 0.41279887482419125,
          'cc': 0

If we want to consider a single value, we can calculate the normalised entropy of the role density. 
Here a value of zero indicates only 1 role is present in the neighbourhood, and a value of one indicates all roles are equally prevalent.

In [23]:
def entropy(arr):
    return -sum([x*np.log2(x)/np.log2(len(arr)) for x in arr if x>0])

entropies = {key:entropy(list(v.values())) for key, v in densities.items()}
entropies

{73: 0.7788592250999151,
 108: 0.6929658141041598,
 96: 0.8023202419191466,
 22: 0.7265859533990632,
 66: 0.9179455860929882,
 39: 0.826775179307021,
 15: 0.8551745939624713,
 100: 0.75868357235704,
 64: 0.7139791628071661,
 106: -0.0,
 86: 0.7488125692975123,
 67: 0.8629507941737385,
 8: 0.8890224608713345,
 7: 0.8762198233246863,
 61: 0.9417346261113357,
 55: 0.8103218117102273,
 101: 0.8547172088523843,
 38: 0.936535739891787,
 13: 0.9348829324907597,
 98: 0.8342022886441014,
 28: 0.9432274696214136,
 94: 0.8107876037764034,
 112: 0.9290352348139592,
 29: 0.6126016192893442,
 68: 0.35391313447815115,
 114: 0.9408744932171064,
 97: 0.8508180238264438,
 49: 0.7426049218212603,
 105: 0.7993101864710466,
 12: 0.818200410042982,
 35: 0.9354802540667478,
 102: 0.8637337992885298,
 19: 0.9086405396654728,
 57: 0.949332774520772,
 87: 0.8499600539234948,
 17: 0.9288666244136675,
 80: 0.9688548986965739,
 84: 0.9011417031105792,
 63: 0.8939447564374572,
 33: 0.9416191048131097,
 45: 0.951166

In [None]:
# Perform a stub shuffle
A.stub_labeled_MCMC(n_steps = 100000)

In [None]:
densities = local_role_density(A, include_focus=False)
densities

In [None]:
densities.apply(entropy, axis=1).mean()

# Next?

Possible next steps for this software include refactoring the internals under pandas and implementation of alternative MCMC schemes, possibly including vertex-labeled ones. 