# OAK validate-definitions command

This notebook is intended as a supplement to the [main OAK CLI docs](https://incatools.github.io/ontology-access-kit/cli.html).

This notebook provides examples for the `validate-definitions` command.
This forms part of a suite of *validate* commands.
    
## Help Option

You can get help on any OAK command using `--help`

In [16]:
!runoak validate-definitions --help

Usage: runoak validate-definitions [OPTIONS] [TERMS]...

  Checks presence and structure of text definitions.

  To run:

      runoak validate-definitions -i db/uberon.db -o results.tsv

  By default this will apply basic text mining of text definitions to check
  against machine actionable OBO text definition guideline rules. This can
  result in an initial lag - to skip this, and ONLY perform checks for
  *presence* of definitions, use --skip-text-annotation:

  Example: -------

      runoak validate-definitions -i db/uberon.db --skip-text-annotation

  Like most OAK commands, this accepts lists of terms or term queries as
  arguments. You can pass in a CURIE list to selectively validate individual
  classes

  Example: -------

       runoak validate-definitions -i db/cl.db CL:0002053

  Only on CL identifiers:

      runoak validate-definitions -i db/cl.db i^CL:

  Only on neuron hierarchy:

      runoak validate-definitions -i db/cl.db .desc//p=i neuron

  Output format:

  This

## Example: Validation over Test Ontology

To illustrate this command we will use a deliberately altered version of a subset of GO.

We will query the subset that are descendants of cellular process using the query `.desc//p=i "cellular_component"`

In [17]:
!runoak -i simpleobo:input/validate-defs-test.obo validate-definitions -C input/validate-definition-conf.yaml .desc//p=i "cellular_component" -o output/validate-definitions.output.tsv

The output is a TSV file with a summary of the issues found.

We can load this into a pandas dataframe for further analysis. This also has the advantage of
displaying tables nicely in Jupyter notebooks such as this one.

If you were actually using this on the command line you may prefer to use your own TSV processing tools,
or to simply load into google sheets.

In [18]:
import pandas as pd
df = pd.read_csv("output/validate-definitions.output.tsv", sep="\t")
df

Unnamed: 0,type,subject,subject_label,severity,instantiates,predicate,object,object_str,source,info
0,oaklib.om:DCC#S3,GO:0043231,intracellular membrane-bounded organelle,WARNING,,IAO:0000115,,Organized structure of distinctive morphology ...,,Cannot parse genus and differentia
1,oaklib.om:DCC#S11,GO:0043231,intracellular membrane-bounded organelle,,,IAO:0000115,,,,Logical definition element not found in text: ...
2,oaklib.om:DCC#S11,GO:0043231,intracellular membrane-bounded organelle,,,IAO:0000115,,,,Logical definition element not found in text: ...
3,oaklib.om:DCC#S3,GO:0099568,cytoplasmic region,WARNING,,IAO:0000115,,Any (proper) part of the cytoplasm of a single...,,Cannot parse genus and differentia
4,oaklib.om:DCC#S3,GO:0099738,cell cortex region,,,IAO:0000115,,complete extent of cell cortex,,Did not match whole text: cell cortex < comple...
5,oaklib.om:DCC#S11,GO:0099738,cell cortex region,,,IAO:0000115,,underlies some some region of the plasma membrane,,"Wrong position, 'cell cortex' not in 'underlie..."
6,oaklib.om:DCC#S3,GO:0071944,cell periphery,WARNING,,IAO:0000115,,The part of a cell encompassing the cell corte...,,Cannot parse genus and differentia
7,oaklib.om:DCC#S11,GO:0031090,organelle membrane,,,IAO:0000115,,is one of the two lipid bilayers of an organel...,,Logical definition element not found in text: ...
8,oaklib.om:DCC#S3,GO:0043229,intracellular organelle,WARNING,,IAO:0000115,,Organized structure of distinctive morphology ...,,Cannot parse genus and differentia
9,oaklib.om:DCC#S11,GO:0043229,intracellular organelle,,,IAO:0000115,,,,Logical definition element not found in text: ...


The rows conform to ValidationResults in the [OAK ontology-metadata](https://w3id.org/oak/ontology-metadata/) data model.

The values of the type field are from the [DefinitionConstraintComponent](https://w3id.org/oak/ontology-metadata/DefinitionConstraintComponent) enumeration.

These themselves are modeled off of the taxonomy from Seppälä, Ruttenberg, and Smith, [Guidelines for writing definitions in ontologies](https://philpapers.org/archive/SEPGFW.pdf).

In [19]:
df["type"].unique()

array(['oaklib.om:DCC#S3', 'oaklib.om:DCC#S11', 'oaklib.om:DCC#Any',
       'oaklib.om:DCC#S0', 'oaklib.om:DCC#S7', 'oaklib.om:DCC#S1',
       'oaklib.om:DCC#S20.1', 'oaklib.om:DCC#S20.2'], dtype=object)

In [20]:
df.groupby("type").size().reset_index(name='counts')

Unnamed: 0,type,counts
0,oaklib.om:DCC#Any,6
1,oaklib.om:DCC#S0,1
2,oaklib.om:DCC#S1,2
3,oaklib.om:DCC#S11,10
4,oaklib.om:DCC#S20.1,1
5,oaklib.om:DCC#S20.2,1
6,oaklib.om:DCC#S3,14
7,oaklib.om:DCC#S7,1


Next we'll filter out less informative columns

In [21]:
df = df[["type", "subject", "subject_label", "object_str", "info"]]
df

Unnamed: 0,type,subject,subject_label,object_str,info
0,oaklib.om:DCC#S3,GO:0043231,intracellular membrane-bounded organelle,Organized structure of distinctive morphology ...,Cannot parse genus and differentia
1,oaklib.om:DCC#S11,GO:0043231,intracellular membrane-bounded organelle,,Logical definition element not found in text: ...
2,oaklib.om:DCC#S11,GO:0043231,intracellular membrane-bounded organelle,,Logical definition element not found in text: ...
3,oaklib.om:DCC#S3,GO:0099568,cytoplasmic region,Any (proper) part of the cytoplasm of a single...,Cannot parse genus and differentia
4,oaklib.om:DCC#S3,GO:0099738,cell cortex region,complete extent of cell cortex,Did not match whole text: cell cortex < comple...
5,oaklib.om:DCC#S11,GO:0099738,cell cortex region,underlies some some region of the plasma membrane,"Wrong position, 'cell cortex' not in 'underlie..."
6,oaklib.om:DCC#S3,GO:0071944,cell periphery,The part of a cell encompassing the cell corte...,Cannot parse genus and differentia
7,oaklib.om:DCC#S11,GO:0031090,organelle membrane,is one of the two lipid bilayers of an organel...,Logical definition element not found in text: ...
8,oaklib.om:DCC#S3,GO:0043229,intracellular organelle,Organized structure of distinctive morphology ...,Cannot parse genus and differentia
9,oaklib.om:DCC#S11,GO:0043229,intracellular organelle,,Logical definition element not found in text: ...


## Missing Definitions

This is the most trivial way to fail a definition check - not to include one. We can see all the missing definitions:


In [22]:
df[df["type"] == "oaklib.om:DCC#S0"]


Unnamed: 0,type,subject,subject_label,object_str,info
14,oaklib.om:DCC#S0,GO:0012505,endomembrane system,,Missing text definition


Of course, in the real ontology this term has a definition

## Non genus-differentia structure

The OAK validate definitions command follows [SRS]( https://philpapers.org/archive/SEPGFW.pdf) and assumes good definitions follow genus-differentia structure.

We can see the ones that fail this (S3):

In [23]:
df[df["type"] == "oaklib.om:DCC#S3"]

Unnamed: 0,type,subject,subject_label,object_str,info
0,oaklib.om:DCC#S3,GO:0043231,intracellular membrane-bounded organelle,Organized structure of distinctive morphology ...,Cannot parse genus and differentia
3,oaklib.om:DCC#S3,GO:0099568,cytoplasmic region,Any (proper) part of the cytoplasm of a single...,Cannot parse genus and differentia
4,oaklib.om:DCC#S3,GO:0099738,cell cortex region,complete extent of cell cortex,Did not match whole text: cell cortex < comple...
6,oaklib.om:DCC#S3,GO:0071944,cell periphery,The part of a cell encompassing the cell corte...,Cannot parse genus and differentia
8,oaklib.om:DCC#S3,GO:0043229,intracellular organelle,Organized structure of distinctive morphology ...,Cannot parse genus and differentia
11,oaklib.om:DCC#S3,GO:0031967,organelle envelope,A double membrane structure enclosing an organ...,Cannot parse genus and differentia
12,oaklib.om:DCC#S3,GO:0031975,envelope,A multilayered structure surrounding all or pa...,Cannot parse genus and differentia
15,oaklib.om:DCC#S3,GO:0005622,intracellular anatomical structure,A component of a cell contained within (but no...,Cannot parse genus and differentia
16,oaklib.om:DCC#S3,GO:9999998,fake term for testing pmid type,fake definition to test retracted typo in refe...,Cannot parse genus and differentia
17,oaklib.om:DCC#S3,GO:0043227,membrane-bounded organelle,Organized structure of distinctive morphology ...,Cannot parse genus and differentia


Many of these are actual definitions rather than ones manipulated for test purposes.

There is room for valid disagreement about whether rewriting some of these following genus-differentia form would improve things for either users or annotators. Arguably at least the subtypes of organelle could simply state how they are differentiated from organelles in general rather than repeating the somewhat wordy _"Organized structure of distinctive morphology..."_

## Circular definitions

In [24]:
df[df["type"] == "oaklib.om:DCC#S7"]

Unnamed: 0,type,subject,subject_label,object_str,info
21,oaklib.om:DCC#S7,GO:0009579,thylakoid,The structure in a plant cell that is known as...,"Circular, thylakoid (GO:0009579 in definition"


## Not following convention

In [25]:
df[df["type"] == "oaklib.om:DCC#S1"]

Unnamed: 0,type,subject,subject_label,object_str,info
29,oaklib.om:DCC#S1,GO:0005773,vacuole,,Definiendum should not appear at the start
31,oaklib.om:DCC#S1,GO:0005737,cytoplasm,,Definiendum should not appear at the start


## Definition Reference Issues

### Typos in PMIDs


In [26]:
df[df["type"] == "oaklib.om:DCC#S20.1"]


Unnamed: 0,type,subject,subject_label,object_str,info
34,oaklib.om:DCC#S20.1,GO:9999998,fake term for testing pmid type,fake definition to test retracted typo in refe...,publication not found: PMID:9999999999999


### Retracted publications

In [27]:
df[df["type"] == "oaklib.om:DCC#S20.2"]


Unnamed: 0,type,subject,subject_label,object_str,info
35,oaklib.om:DCC#S20.2,GO:9999999,fake term for testing retraction,,publication is retracted: A role for plasma tr...


# Using LLMs to validate definitions

For this example we will use an LLM to validate this GO catalytic activity:

```yaml
[Term]
id: GO:0000010
name: trans-hexaprenyltranstransferase activity
namespace: molecular_function
alt_id: GO:0036422
def: "Catalysis of the reaction: (2E,6E)-farnesyl diphosphate + 4 isopentenyl diphosphate = 4 diphosphate + all-trans-heptaprenyl diphosphate." [PMID:9708911, RHEA:27794]
synonym: "all-trans-heptaprenyl-diphosphate synthase activity" RELATED [EC:2.5.1.30]
synonym: "HepPP synthase activity" RELATED [EC:2.5.1.30]
synonym: "heptaprenyl diphosphate synthase activity" RELATED []
synonym: "heptaprenyl pyrophosphate synthase activity" RELATED [EC:2.5.1.30]
synonym: "heptaprenyl pyrophosphate synthetase activity" RELATED [EC:2.5.1.30]
xref: EC:2.5.1.30
xref: MetaCyc:TRANS-HEXAPRENYLTRANSTRANSFERASE-RXN
xref: RHEA:27794
```

There are two references for this:

  - the publication [PMID:9708911](https://pubmed.ncbi.nlm.nih.gov/9708911/)
  - the RHEA reaction [RHEA:27794](https://www.rhea-db.org/reaction?id=27794)

In [13]:
!runoak --stacktrace -i llm:{claude-3-opus}:simpleobo:input/validate-defs-test.obo validate-definitions -C input/validate-definition-conf.yaml GO:0000010 -O yaml -o output/validate-definitions.llm.yaml

In [14]:
import yaml
report = yaml.safe_load(open("output/validate-definitions.llm.yaml"))

In [15]:
for k, v in report.items():
    if len(str(v)) > 50:
        lines = v.split("\n")
        lines = [f"  {line}" for line in lines]
        lines = [""] + lines + [""]
        v = "\n".join(lines)
    print(f"{k}: {v}")

        

type: https://w3id.org/oak/ontology-metadata/DCC.S20
subject: GO:0000010
severity: INFO
predicate: IAO:0000115
object_str: 
  id: PMID:9708911
  title: Biological significance of the side chain length of ubiquinone in Saccharomyces
    cerevisiae.
  abstract: Ubiquinone (UQ), an important component of the electron transfer system,
    is constituted of a quinone structure and a side chain isoprenoid. The side chain
    length of UQ differs between microorganisms, and this difference has been used for
    taxonomic study. In this study, we have addressed the importance of the length of
    the side chain of UQ for cells, and examined the effect of chain length by producing
    UQs with isoprenoid chain lengths between 5 and 10 in Saccharomyces cerevisiae.
    To make the different UQ species, different types of prenyl diphosphate synthases
    were expressed in a S. cerevisiae COQ1 mutant defective for hexaprenyl diphosphate
    synthesis. As a result, we found that the original species

__COMMENTARY__

Note that as this is an LLM the output differs every time!

In some cases, the LLM is failing to see that the paper is indeed about trans-hexaprenyltranstransferase activity, the output is useful as it shows us that the abstract is not directly about this activity.