# OAK statistics command

This notebook is intended as a supplement to the [main OAK CLI docs](https://incatools.github.io/ontology-access-kit/cli.html).

This notebook provides examples for the `statistics` command, which can be used to calculate basic descriptive statistics
for an ontology

## Help Option

You can get help on any OAK command using `--help`

In [1]:
!runoak statistics --help

Usage: runoak statistics [OPTIONS] [BRANCHES]...

  Shows all descriptive/summary statistics

  Example: -------     runoak -i sqlite:obo:pr statistics

  By default, this will show combined summary statistics for all terms

  You can also break down the statistics in two ways:

  - by a collection of branch roots

  - by a metadata property (e.g. oio:hasOBONamespace, rdfs:isDefinedBy)

  - by prefix (e.g. GO, PR, CL, OBI)

  Example: -------     runoak -i sqlite:obo:pr statistics -p
  oio:hasOBONamespace

  Note: the oio:hasOBONamespace is *not* the same as the ID prefix, it is a
  field that is used by a subset of ontologies to partition classes into broad
  groupings, similar to subsets. Its use is non-standard, yet a lot of
  ontologies use this as the main partitioning mechanism.

  A note on bundled ontologies:

  The standard release many OBO ontologies "bundles" parts of other ontologies
  (formally, the release product includes a merged imports closu

## Set up an alias

For convenience we will set up some aliases for use in this notebook

In [18]:
alias chebi runoak -i sqlite:obo:chebi

## Calculating summary statistics (default YAML output)

We can calculate the summary stats using the `statistics` command. The output is quite lengthy,
so we will use `--output` (`-o`) to direct to a yamml file:

In [19]:
chebi statistics -o output/chebi.stats.yaml



__Note__ CHEBI has a lot of bad xrefs, hence the output

## Exploring the output

Let's look at the top of the YAML file:

In [20]:
!head -50 output/chebi.stats.yaml

id: AllOntologies
ontologies:
- id: obo:chebi.owl
  version: obo:chebi/231/chebi.owl
was_generated_by:
  started_at_time: '2024-03-26T17:29:56.627143'
  was_associated_with: OAK
  acted_on_behalf_of: cjm
class_count: 217549
deprecated_class_count: 18650
non_deprecated_class_count: 198899
class_count_with_text_definitions: 53575
class_count_without_text_definitions: 163974
object_property_count: 10
annotation_property_count: 37
named_individual_count: 0
subset_count: 3
rdf_triple_count: 6860047
subclass_of_axiom_count: 368285
equivalent_classes_axiom_count: 0
edge_count_by_predicate:
  BFO:0000051:
    facet: BFO:0000051
    filtered_count: 4029
  RO:0000087:
    facet: RO:0000087
    filtered_count: 43636
  obo:chebi#has_functional_parent:
    facet: obo:chebi#has_functional_parent
    filtered_count: 19632
  obo:chebi#has_parent_hydride:
    facet: obo:chebi#has_parent_hydride
    filtered_count: 1799
  obo:chebi#is_conjugate_acid_of:
    facet: obo:c

Like all objects produced by OAK, there is a data dictionary / data model. The ontology stats
one is [https://w3id.org/oak/summary-statistics](https://w3id.org/oak/summary-statistics),
you can use this link to browse documentation etc.

**A well defined data dictionary is necessary for communicating aggregate statistics accurately**.
Often when ontologies are reported informally, it's ambiguous whether *number of terms* means:

- number of *classes*, *classes plus relationship types*, or *classes plus some other elements*
- including or excluding deprecated (obsolete) entities

The OAK summary statistics data dictionary aims to provide a **standard for ontology reporting**.

YAML allows for nesting which is a natural way to group things; for example:

```yaml
edge_count_by_predicate:
  BFO:0000051:
    facet: BFO:0000051
    filtered_count: 4003
  RO:0000087:
    facet: RO:0000087
    filtered_count: 43082
```

This says that there are 4003 part-of (BFO:0000050) and 43082 has-role (RO:00000087) [relationships](https://incatools.github.io/ontology-access-kit/glossary.html#term-Relationship).

See the [OAK guide to relationships](https://incatools.github.io/ontology-access-kit/guide/relationships-and-graphs.html)
to understand more.

## Mapping Stats

Further on in the YAML we can see mapping stats. See (https://w3id.org/ssssom)[https://w3id.org/ssssom] to
understand the OAK mapping data model.

These are broken down

- by mapping predicate (for many ontologies this is only `oio:hasDbXref`)
- my mapping object source (i.e. the database or ontology that is mapped to)

In [21]:
!grep -A40 ^mapping_statement_count output/chebi.stats.yaml

mapping_statement_count_by_predicate:
  oio:hasDbXref:
    facet: oio:hasDbXref
    filtered_count: 345271
mapping_statement_count_by_object_source:
  BFO:
    facet: BFO
    filtered_count: 1
  RO:
    facet: RO
    filtered_count: 1
  KNApSAcK:
    facet: KNApSAcK
    filtered_count: 5185
  KEGG:
    facet: KEGG
    filtered_count: 22228
  CAS:
    facet: CAS
    filtered_count: 28938
  KEGG_COMPOUND:
    facet: KEGG_COMPOUND
    filtered_count: 19870
  Beilstein:
    facet: Beilstein
    filtered_count: 9187
  IUPAC:
    facet: IUPAC
    filtered_count: 61013
  ChemIDplus:
    facet: ChemIDplus
    filtered_count: 33383
  UniProt:
    facet: UniProt
    filtered_count: 16047
  LINCS:
    facet: LINCS
    filtered_count: 41392
  Drug_Central:
    facet: Drug_Central
    filtered_count: 3784
  DrugCentral:
    facet: DrugCentral
    filtered_count: 6202
  Wikipedia:
--
mapping_statement_count_subject_by_object_source:
  BFO:
    facet: B

As expected, CHEBI does not make use of SKOS mapping predicates, and mappings
are dominated by databases like KEGG, CAS.


## TSV Output

YAML is not a very natural format for doing further data science or statistical processing.

FOr that we can use the `csv` option (which actually defaults to tsv...)

In [9]:
chebi statistics -o output/chebi.stats.tsv -O csv



To illustrate this we will use pandas:

In [11]:
import pandas as pd
df = pd.read_csv("output/chebi.stats.tsv", sep="\t")
df


Unnamed: 0,id,compared_with,agents,class_count,deprecated_class_count,non_deprecated_class_count,class_count_with_text_definitions,class_count_without_text_definitions,object_property_count,annotation_property_count,...,mapping_statement_count_subject_by_object_source_CTX,mapping_statement_count_subject_by_object_source_SMID,class_count_by_subset_1_STAR,class_count_by_subset_2_STAR,class_count_by_subset_3_STAR,was_generated_by_started_at_time,was_generated_by_was_associated_with,was_generated_by_acted_on_behalf_of,ontologies_id,ontologies_version
0,AllOntologies,,,185295,18628,166667,53049,132246,10,37,...,3,307,2945,102919,60803,2024-03-26T17:07:33.778117,OAK,cjm,obo:chebi.owl,obo:chebi/226/chebi.owl


This format is useful if you have multiple ontologies (see later).
But for a single ontology it's more convenient to melt this:

In [13]:
mdf = df.melt(var_name='Property', value_name='Value')
mdf[0:40]

Unnamed: 0,Property,Value
0,id,AllOntologies
1,compared_with,
2,agents,
3,class_count,185295
4,deprecated_class_count,18628
5,non_deprecated_class_count,166667
6,class_count_with_text_definitions,53049
7,class_count_without_text_definitions,132246
8,object_property_count,10
9,annotation_property_count,37


Note this uses a very generic way of flattening the yaml so some columns make less sense out of context - 
e.g. the "agent" field belongs to a parent object that describes what "agent" generated the stats
(TODO: this should say "oaklib")

## Multi-ontology merges

Many OBO ontologies bundle portions of other ontologies with their main release. This can
be confusing! For more details see [OWL Format Variants](https://oboacademy.github.io/obook/explanation/owl-format-variants/)
in the obook.

As an example, consider naively calculating stats for the standard release of the
Cell Ontology (CL):

In [14]:
!runoak -i sqlite:obo:cl statistics | head -20

id: AllOntologies
ontologies:
- id: obo:cl.owl
  version: obo:cl/releases/2023-09-21/cl.owl
  version_info: '2023-09-21'
was_generated_by:
  started_at_time: '2024-03-26T17:16:02.669245'
  was_associated_with: OAK
  acted_on_behalf_of: cjm
class_count: 28330
deprecated_class_count: 261
non_deprecated_class_count: 28069
class_count_with_text_definitions: 15110
class_count_without_text_definitions: 13220
object_property_count: 297
annotation_property_count: 241
named_individual_count: 18
subset_count: 63
rdf_triple_count: 681623
subclass_of_axiom_count: 44142


Looking at this you might think CL has 28k classes. In fact, this is the total number of
classes in the ontology as defined by OWL, where here "ontology" means the merged
product that includes bits of GO, Uberon, etc. Confusing, huh?

Ideally the OBO Foundry would move towards making *base files* the default, but in the absence of this,
we have a few options:

* Filtering by prefix (using `-P`)
* Grouping using some property such as the prefix.

We'll try the latter


In [22]:
!runoak -i sqlite:obo:cl statistics --group-by-prefix -o output/cl.stats.grouped.tsv -O csv

In [23]:
df = pd.read_csv("output/cl.stats.grouped.tsv", sep="\t")
df

Unnamed: 0,id,compared_with,agents,class_count,deprecated_class_count,non_deprecated_class_count,class_count_with_text_definitions,class_count_without_text_definitions,object_property_count,annotation_property_count,...,class_count_by_subset_non_informative,class_count_by_subset_organ_slim,class_count_by_subset_pheno_slim,class_count_by_subset_phenotype_rcn,class_count_by_subset_uberon_slim,class_count_by_subset_unverified_taxonomic_grouping,class_count_by_subset_upper_level,class_count_by_subset_vertebrate_core,mapping_statement_count_by_object_source_GOREL,mapping_statement_count_subject_by_object_source_GOREL
0,<http,,,0,0,0,0,0,0,1,...,,,,,,,,,,
1,<https,,,0,0,0,0,0,0,1,...,,,,,,,,,,
2,BFO,,,15,0,15,9,6,6,0,...,,,,,,,,,,
3,BSPO,,,0,0,0,0,0,24,0,...,,,,,,,,,,
4,CARO,,,20,0,20,20,0,0,0,...,,,,,,,,,,
5,CHEBI,,,123,0,123,18,105,0,0,...,,,,,,,,,,
6,CL,,,2969,249,2720,2555,414,3,0,...,,,,,,,,,,
7,GO,,,7265,2,7263,7264,1,0,0,...,,,,,,,,,,
8,IAO,,,6,0,6,4,2,0,23,...,,,,,,,,,,
9,NCBITaxon,,,138,0,138,0,138,0,0,...,,,,,,,,,,


Here we can see the numbers broken down by ontology. The number of classes in the CL row is now accurate.
Note of course that the other numbers don't reflect totals for the external ontology as a whole -- it's
just the number that has been merged into CL


## Diff stats

You can also use `--compare-with` to compare stats with a different release of an ontology. Note this
is effictively the same as running `diff` with `--statistics`. See diff docs for details.