Skip to content
Kent Shefchek edited this page Mar 14, 2017 · 14 revisions

Post processors are meant to modify the graph right after the standard graph load occurs.

Clique merge

Motivation

Two different ontologies can use different IDs to describe the same entity. These instances will typically be connected with an axiom such as equivalentClass or sameAs. Such sets are called cliques. However one of those two ontologies may provide different or more information which still holds for the equivalent nodes. Instead of traversing all the nodes within the cliques and gathering the different information of interests, they can be consolidated ahead of time.

Solution

Within a clique, a leader is chosen to represent the entire clique. The leader will be tagged with the Neo4j label cliqueLeader. All the edges from the non-leaders will be moved to the leader. Only the equivalent edge with the leader will remain. For traceability the moved edges will get additional properties containing the original target and source node ids.

The leader is prioritized with 3 different strategies:

  1. Annotation
  • Nodes annotated in the ontology with a specific leader hint will be used in priority as leader. If two nodes within the same clique have this annotation, it'll fallback on 2.
  1. Prefixes prioritization
  • Provide a list of prefixes to prioritize in order. If this fails to select a leader, fallback on 3.
  1. Alphabetical order
  • Will select the first node in alphabetical order.

Usage

The clique post processor is enabled and configurable in the yaml load file:

cliqueConfiguration:
  relationships:
    - [list of relationships that define a clique]
    - ...
  leaderAnnotation: [Annotation in the ontology to design a leader]
  leaderPriority:
    - [list of prefixes to prioritize in order]
    - ...
  leaderForbiddenLabels:
    - [list of Neo4j labels which cannot be leaders]
    - ..
  batchCommitSize: [Optional int to define commits after x nodes have been processed. Influences the memory usage.]

For example:

cliqueConfiguration:
  relationships:
    - sameAs
    - equivalentClass
  leaderAnnotation: https://monarchinitiative.org/MONARCH_cliqueLeader
  leaderPriority:
    - http://www.ncbi.nlm.nih.gov/gene/
    - http://www.ncbi.nlm.nih.gov/pubmed/
    - http://purl.obolibrary.org/obo/NCBITaxon_
    - http://identifiers.org/ensembl/
    - http://purl.obolibrary.org/obo/OMIM_
    - http://purl.obolibrary.org/obo/DOID_
    - http://www.orpha.net/ORDO/Orphanet_
    - http://purl.obolibrary.org/obo/HP_
    - http://purl.obolibrary.org/obo/MP_
    - http://purl.obolibrary.org/obo/ZP_
  leaderForbiddenLabels:
    - anonymous
  batchCommitSize: 100000

Edge labeler

Motivation

In SciGraph, edge are typed by their IRIs. For graph exploration or debugging, it is convenient to add the labels to the edges.

Solution

Since edges' types are IRIs (in most cases), we can fetch the associated node property label and add it to the edge property. If no node is found, for example for an owl relationship, or if no label exists then the edge will still get a lbl property using the type.

Note: The property lbl is used instead of label because label is a reserved edge property in TinkerGraphs. TinkerGraphs are widely used in SciGraph for in-memory data modeling therefore, so instead of handcrafting a wrapper around that, it was easier to use a non-conventional edge property, especially that it is provided more for convenience.

For example, before running the edge labeler:

:isDefinedBy[14288]{}
:http://purl.obolibrary.org/obo/RO_0002525[94272775]{iri:"http://purl.obolibrary.org/obo/RO_0002525",convenience:true,owlType:"subClassOf"}

After:

:isDefinedBy[14288]{lbl:"isDefinedBy"}
:http://purl.obolibrary.org/obo/RO_0002525[94272775]{lbl:"is subsequence of",iri:"http://purl.obolibrary.org/obo/RO_0002525",convenience:true,owlType:"subClassOf"}

Usage

Add this line in the yaml configuration file to enable the Edge labeler:

addEdgeLabel: true

All nodes labeler

Motivation

In order to use Neo4j's schema index, in addition to a property a Neo4j label has to be provided. A user will often want to look for a node by an IRI. Without a schema index, Neo4j will scan all the nodes to return the matching one.

Solution

Tag all the nodes in the graph with a generic label, so a schema index can be added for this generic label and a property, for example the iri.

Usage

Add this line in the yaml configuration file to enable the all nodes labeler:

allNodesLabel: Node

Anonymous node tagger

Motivation

For users using skolemization patterns to represent blank nodes as IRIs and wish to maintain the 'anonymous' Neo4J label on these nodes.

Solution

Nodes that need to be tagged with the OWL anonymous label are expected pre-marked with a node property. The post-processor will look for the nodes with this property and will label them.

Usage

Add this line in the yaml configuration file to enable the anonymous node tagger:

anonymousNodeProperty: https://monarchinitiative.org/MONARCH_anonymous