# Data Cleaning, Degenerate Edges and MCMC

An edge is *degenerate* if it contains the same node multiple times, possibly in different roles. In the Enron email data set, for example, if a person cc'd themselves on an email, then they will appear in the corresponding edge twice. Degenerate edges generalize self-loops on dyadic networks, and can generate some technical issues in the context of MCMC algorithms. 

In [1]:
import numpy as np
import json
from itertools import groupby
from collections import namedtuple

from ahyper import utils, annotated_hypergraph

In [2]:
with open('data/enron_hypergraph_annotated.json') as file:
    data = json.load(file)

roles = ['cc', 'from', 'to']

In [3]:
A = annotated_hypergraph.AnnotatedHypergraph(data, roles)

In [4]:
A.n, A.m

(112, 10504)

We can count the number of degenerate edges in A using the `annotated_hypergraph.count_degeneracies()` method. Each edge is considered degenerate if at least one node is repeated at least once -- multiple repetitions don't make the edge "more degenerate." 

In [5]:
print('There are ' + str(A.count_degeneracies()) + ' degenerate edges.')

There are 1191 degenerate edges.


The following method can be used to clean degenerate edges in A by removing offending node-edge incidences. The `precedence` keyword states the order in which degeneracies should be removed. In the example below, if a node appears in an edge with both a `from` role and a `cc` role, the `cc` role will be deleted. 

In [6]:
A.remove_degeneracies(precedence = {'from' : 1, 'to' : 2, 'cc' : 3})
print('There are ' + str(A.count_degeneracies()) + ' degenerate edges.')

Removed 1246 node-edge incidences
There are 0 degenerate edges.


## Singletons

Edges with only one node can cause interpretation issues. They can arise either when they are correctly present in the data (e.g. single-author papers), when they are incorrectly present in the data, or when they arise as an artifact of other cleaning operations. For example, an edge in the original data that consists of a single person sending themselves a message contains the same node twice, once as `from` and once as `to`. After removing degeneracies, this edge will consist of a single `from` node. The following method can be used to remove such artifacts. 

In [7]:
A.remove_singletons()

Removed 901 singletons.


The new dimensions of the data are: 

In [8]:
A.m, A.n

(9603, 110)

# Degeneracy-Avoiding MCMC

The first stub-labeled MCMC algorithm implemented can create degeneracies:

In [9]:
print('There are ' + str(A.count_degeneracies()) + ' degenerate edges.')

There are 0 degenerate edges.


In [10]:
A.stub_labeled_MCMC(n_steps=100000)
print('There are ' + str(A.count_degeneracies()) + ' degenerate edges.')

There are 466 degenerate edges.


The below code checks edges prior to executing swaps in order to avoid degeneracies. This comes at the cost of some performance hits: there is a rejection probability, and I believe each step is also somewhat more expensive now. 

In [None]:
A.remove_degeneracies(precedence = {'from' : 1, 'to' : 2, 'cc' : 3})
A.MCMC(n_steps = 100000)
print('There are ' + str(A.count_degeneracies()) + ' degenerate edges.')

Removed 554 node-edge incidences


## Quick Check 

In [12]:
E0 = A.edge_dimensions(by_role=True).copy()
D0 = A.node_degrees(by_role=True).copy()

A.MCMC(n_steps = 100000)

A.edge_dimensions(by_role = True) == E0, A.node_degrees(by_role = True) == D0 

100000 steps taken, 154420 steps rejected.


(True, True)