Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
BioPAX Validation rules and best practices
Those marked with
* require more discussion.
1. Domain, Range and Cardinality
1.1 During OWL import, the validator converts most cardinality errors into "syntax errors" and does not stop (greedy behavior).
1.2 Cardinality rules that check min/max/exact (incl. 'functional') OWL property restrictions.
1.2.2 Complex cellular location (it is convenient for users if the components of the complex also have the matching cellular location.)
1.2.3 Rules that check inverse-functional property restriction.
1.3 Instances of highly abstract ontology classes; e.g., more specific sub-classes should be used instead of Conversion, PhysicalEntity, and EntityReference.
1.4 Prevent unnecessarily generic interactions, e.g., complex assemblies represented as conversions.
1.5 Dangling RDF IDs (BioPAX elements)
1.6 References/property values to objects (IDs) that are absent from the file/model (dangling).
1.7 ModificationFeature should have a value for featureLocation (warning)
1.8 PhysicalEntity (and subclasses) should have displayName (warning) see 2.24
1.9 ControlledVocabulary (and subclasses) should have a value for the term property (warning)
1.10 Catalysis.controlType - only "ACTIVATION"
1.11 TemplateReactionRegulation.controlType - either "ACTIVATION" or "INHIBITION"
2. Proper Semantics
(use of properties, sub-properties, chemical semantics, etc.)
2.1 Conversion shouldn't have participants, but instead should have participant sub-properties, like left and right, filled. (fortunately, Paxtools throws UnsupportedOperationException on attempts to update participant directly, which the validator should report)
2.2 PathwayStep (in a Pathway.pathwayOrder) - if listed in the nextStep property of another PathwayStep, must have not empty intersection of participants of their stepProcess-es (Process is an interaction or pathway).
2.3 PathwayStep.stepProcess (or stepConversion) - the same component (Process) must be found in the corresponding Pathway.pathwayComponent property (may use Process.isPathwayComponentOf() to check).
2.4 Check for long display names (property: displayName) e.g. >25 characters is not good because it doesn't display well on a computer screen (best practice). For Provenance, this is 50.
2.5 Note: ComplexAssembly is always reversible? We can't say, it depends on the complex
2.6 If BiochemicalPathwayStep is used, its stepDirection must be set (otherwise, consider using PathwayStep instead), and step process's catalysisDirection must be the same or empty value, and step conversion's conversionDirection - the same or REVERSIBLE.
2.7 If stepDirection of BiochemicalPathwayStep is not empty then ‘direction’ of the Catalysis instance is either blank, "REVERSIBLE" or the same. merged to 2.6
2.8 Pathway.organism - for multi-organism networks must specify organism on the component physical entities (within interactions) and pathways, instead of using this property. What to do if it's not the case? Three options: skip, delete this one, or rewrite nested organism properties with this value. Warn if a pathway and its component have different (not null) 'organism' value.
2.9 dataSource - should be used to describe the latest source of the data, not all (set cardinality=1 constraint).
2.10 Warn on OWL documentation elements rdfs:comment usage in Entities. Entity.comment property should be used instead. Auto-replace. (in the Paxtools (SimpleReader only), rdfs:comment is converted to bp:comment, and the warning propagates to the validator)
2.11 BiochemicalPathwayStep - only one conversion interaction can be ordered at a time, though multiple catalysis or modulation instances can be part of one step. Conversion usage is restricted to the stepConversion property (BP3 OWL have been updated accordingly by Emek.)
2.12 BiochemicalPathwayStep - intended for use when the direction of the reaction is not clear and should not be used otherwise (use PaswayStep then; BioPAX L3 may change implementation how direction is stored, pending resolution of open issue). merged into 2.6
2.13 Conversion - check entity conservation, e.g. entities, like proteins should be the same on both sides of a conversion, just in different states (Transport.LEFT participants are the same as Transport.RIGHT, but in different compartment.)
2.14 Check that if a direction is specified in BiochemicalPathwayStep.stepDirection, then the corresponding conversionDirection property (of the Conversion) in the stepConversion property is specified as "REVERSIBLE". merged into 2.6
2.15 * Post-translational modifications match the sequence, e.g. phosphorylation modification position actually happens on a residue that can be phosphorylated e.g. STY residues. If sequence is not present in the protein reference, give a warning that this rule could not be checked. (Version 2.0 of rule: Normalize sequence i.e. if sequence is not there, use the sequence ID to get it from an online database)
2.16 Directions to be consistent, i.e. that a reaction that goes left to right is not specified as going in the opposite direction in the BiochemicalPathwayStep merged into 2.6
2.17 * intraMolecular must agree for connected BindingFeatures.
2.18 Physical entities that reference the same EntityReference must be in different states (i.e. features on the PhysicalEntity can't be exactly the same on another PhysicalEntity, e.g. two proteins that reference the same ProteinReference must have different phosphorylation states.
2.19 Warning if participants of a single complex span multiple cellular compartments. Actually check for valid compartment spanning e.g. nucleus does not touch extracellular space, so you can't have a complex that spans these compartments without having participants also in the intermediate compartments. Perhaps, GO "Cellular Component" could be used to automatically check this, since it has 'part of' relationships between compartments. [thinking...] *
2.20 ComplexAssembly has at least one complex on one side.
2.21 Warning on participants of one interaction span multiple cellular compartments. (
* now just CL terms are compared, but the ontology tree may be required).
2.22 BiochemicalReaction (base) - the participants on one side don't change compartments on the other side (otherwise it should be a BiochemicalReactionWithTransport).
2.23 EntityReference (its subclasses) should have at least one UnificationXref (best practice)
Missing names in ProteinReference - it would be useful to set at least the protein displayName in ProteinReference. Encourage the use of displayName for the following things: Protein, ProteinReference, SmallMolecule*, Dna*, Rna*, etc. (all classes that have this property) This is mostly for convenience (should be warning).
2.25 Usually complexes have more than one component, unless stoichiometry greater than one was set on the single component, as would be done for a homodimer.
2.26 Warn on typical 'NULL' values in any field. The typical null values that we sometimes encounter is 'NIL', 'nil' (case insensitive), 0, -1 for certain fields where that doesn't make sense, like database IDs. The validation rule uses a dictionary of frequently seen ''null'' values and corresponding fields that should be checked.
2.27 Warn if two different kind (i.e., having different entity reference) physical entities have a name in common. [not sure if we need this]
2.28 * memberPhysicalEntity – use of this property is not recommended (warning). It is only defined to support legacy data in certain databases. In general, the EntityReference class should be used to create generic groups of physical entities, however there are some cases where this is not possible (e.g., generic Complex).
2.29 A feature listed in notFeature or feature properties should be also listed in the corresponding EntityReference.entityFeature list.
2.30 * The stoichiometric coefficients of the left and the right sides of a Conversion should match to each other if it is not a Degradation (TODO: make clear...)
2.31 A Conversion should be of type ComplexAssembly if a complex is formed and the PhysicalEntity participants are not modified during the process.
2.32 If a PhysicalEntity participates in a Conversion as a component of a Complex, then it should have a BindingFeature for clarity; if not, i.e. if it participates in a reaction as a separate molecule, then two PhysicalEntities (one as a participant and one as a Complex component) should be used.
3. Controlled Vocabularies
3.1 CellularLocationVocabulary -> GO:0005575 cellular component. If a location is unknown then no term should be specified (do not use the GO term for 'cellular component unknown' GO:0008372) + As this is the old/synonymous term name for the GO:0005575, this actually means - do not use the term "cellular_component" itself; so only children are allowed)
3.2 EvidenceCodeVocabulary -> BioCyc Evidence Ontology probably won't be part of MI because the groups unfortunately don't collaborate. Also, because there are new evidence code CVs that are coming out every now and then, we should probably make this rule a 'warning' level, as not everyone will want to use, in practice, the CVs we recommend - MI "interaction detection method", "participant identification method", "feature detection method" children; Pathway Tools Evidence Ontology may also be used. It allows multiple terms, at least one should be from the recommended CV.
3.4 * SequenceModificationVocabulary >>>
3.4.1 * For Protein, only covalent modifications at specific positions can be used
, children of "biological feature". Update: use PSI MOD ontology here; terms under
(also because PSI-MI "post translation modification" terms are now obsolete and should no longer be used).
3.4.2 For Rna and Dna, it should not be from MI or MOD. Which CV to use for RNA and DNA? Sequence Ontology has some CV terms for DNA, e.g,. SO:0000306 is "methylated_base_feature", a child of "modified_base_site". We will continue searching for more CVs that are useful.
3.5 SequenceRegionVocabulary -> controlled vocabulary of sequence regions, such as InterPro or Sequence Ontology (SO);
3.6 CellVocabulary -> Cell Type Ontology (CL).
3.7 InteractionVocabulary -> MI: "interaction type" branch.
3.8 RelationshipTypeVocabulary -> MI: "cross reference type" branch.
3.9 TissueVocabulary -> BRENDA Tissue Ontology (BTO).
3.10 PhenotypeVocabulary - this is only the type, not the value (e.g. for a synthetic lethal interaction, the phenotype is "viability", specified by ID: PATO:0000169, "viability", not the value (specified by ID: PATO:0000718, "lethal (sensu genetics)".) There may occur patoData property. PATO:0001995 "organismal quality" term and its children; PATO:0000169 "viability" but no its children. [TODO: more?..]
BioPAX Validator greedily collects problems that happen during the model is read and built by Paxtools and reports them as "syntax error". Such errors should be tackled first, because they are often the root cause for other (post-model, semantic) errors and warnings. In other words, one would see an exception (failure) or get into less apparent trouble when using such BioPAX data in another application. For example, Paxtools would fail and exit, Protege - fail to open it or simply set some of values to null, etc.
4.1 Deal with syntax errors in XML, RDF, OWL. This is a problem for users that don't use paxtools, jena or XML aware API to create BioPAX OWL files or to catch problems when editing OWL file in a text editor. (validator reports errors popped from paxtools, jena parser, while loading pathway data; can use OWL API; there are issues, however, it may be difficult to collect all errors at one pass when loading it through paxtools/jena.)
4.2 * Some support (e.g. syntax check, CVs, Xrefs) for Levels 1 and 2. [done, but more rules are to be implemented]
4.3 A global BioPAX element's identifier - URI (value of rdf:about - or xml:base + rdf:ID); must be a valid URI (RFC 2396). For some of UtilityClass objects, such as EntityReference (and its subclasses), ControlledVocabulary (and subclasses) and PublicationXref (but not for other Xref types), as well as for Pathways where stable Reactome, KEGG, etc. IDs exist, prefer using Identifiers.org based standard URIs.
5. Xref Rules
5.1 Use database names from MIRIAM where possible for 'db' property (it also contains regexps to validate ID format), otherwise, use children terms of MI "database citation". PSI-MI group mentioned that they are working with MIRIAM group to standardize database names CV. Warn if an xref (or specific types of xrefs) does not contain either 'db' or 'id' values, or both.
5.2 db: create a list of commonly misspelled database names which we can recognize and automatically correct. Miriam may have errors (e.g., wrong regexp for uniprot, - FIXED), or no synonyms for some databases (then unofficial synonyms and those from MI can be used).
5.3 id: use MIRIAM to check ID format.
5.4 * id: keep version separate from the identifier (e.g., CAA61361.2 => id=CAA61361, idVersion=2), for CV IDs, keep the namespace with the ID, like "GO:01234" not "01234" is naturally achieved by applying a regexp; however, correction of such errors can be implemented. (There is no universal way to recognize "version" in any type of bio identifier).
5.5 UnificationXref applicability (i.e., limit its usage, - list BioPAX types and what their unification xrefs can/cannot refer to, e.g.: Dna or Rna - NOT UniProt; PhysicalEntity - NOT Gene Ontology; etc.):
5.5.1 UniProt - only for proteins (protein references)
5.5.2 GO IDs shouldn't be in unification xref of physical entities.
5.5.3 BioSource - use only "Taxonomy"
5.5.4 Interaction - not PSI-MI
5.5.5 ProteinReference - not Entrez Gene, not OMIM...
5.5.6 Provenance - is a data source and, normally, should not use any unification xrefs with 'db' property equals: 'UniProt', 'RefSeq', 'Go', or 'MI', etc. (i.e., - known bio resources); but, e.g., x.db=Miriam, x.id=Reactome (or http://identifiers.org/reactome) may be correct.
5.6 * Use an "ID mapper" to check/advise that best, not deprecated values are used (this might make the application heavy).
5.7 A db="Entrez Gene" ("NCBI Gene"), "HGNC", "HGNC Symbol", "OMIM", etc., Xref should be RelationshipXref type if attached to a ProteinReference. Only a reference to the same object in another database should be a UnificationXref, and a gene is not a protein.
6. Unnecessary Duplication
6.1 Duplicate Protein instances - only create one Protein instance for each protein state. E.g., there are 2 BLK protein instances that are identical. It would be ok to create 2 BLKs if they were different states of the same protein 'BLK', such as BLK-nuleus and BLK-cytoplasm, or BLK-phosphorylated@T77. Each of these states would reference the same protein reference instance.
6.2 Duplicate names rule: if you add a name to standard name or display name, don't add it to name (but in the BioPAX Level 3 the opposite is advised)
6.3 Warn when utility classes are duplicated. Users are allowed to reuse instances of classes like RelationshipTypeVocabulary if needed. Sometimes they may create a lot of these, but it is unnecessary and these take up extra space in the file. This is an example of a rule that could have an 'fix the data' option which would remove all of the redundant instances and update the references accordingly to point to the single remaining instances.
6.3.1 Only create one ProteinReference instance for each type of protein, as determined by UnificationXref, e.g. the one having db="HPRD" or "UniProt".
6.3.2 Different unification xrefs should point to different resources (should not be semantically equivalent). (IR: However, BioPAX spec. allows this)
7. Topology and Structure
7.1 Pathway inclusions should be acyclic (note: biopax-validator reports as warning)
* 10. Advanced Validation and Analysis
(these rules may be fuzzy, depend on external mappers and validators, and difficult to check)
10.1 sequence is correct
10.2 entity feature is consistent with location, sequence, etc.
10.3 multiple unification xrefs are about the same thing...
10.3.1 unification xrefs are of the same organism, which matches the one specified by organism property (if applicable)
10.4 xref.id is actually present in the database (xref.db)