# OAK diff-associations command

This notebook is intended as a supplement to the [main OAK CLI docs](https://incatools.github.io/ontology-access-kit/cli.html).

This notebook provides examples for the `diff-associations` command which provides ways of comparing [associations](https://incatools.github.io/ontology-access-kit/glossary.html#term-Association).

For more on associations, see [Associations and Curated Annotations](https://incatools.github.io/ontology-access-kit/guide/associations.html) in the OAK guide.

For more on command line usage in general, see the [Command Line Tutorial](https://doi.org/10.5281/zenodo.7708963)

## Help Option

You can get help on any OAK command using `--help`

In [1]:
!runoak diff-associations --help

Usage: runoak diff-associations [OPTIONS]

  Diffs two association sources.

  Example:

      runoak -i sqlite:obo:go  -G gaf  diff-associations            --old-date
      ${date1} --new-date ${date2}            -g
      "${download_dir}/${group}-${date1}.gaf"            -X
      "${download_dir}/${group}-${date2}.gaf"            --group-by
      publications -p i,p            -o
      "${group}-diff-${date1}-to-${date2}.tsv

  See https://w3id.org/oak/association for the diff data model.

  NOTE: This functionality may move out of core

Options:
  -o, --output FILENAME          Output file, e.g. obo file
  -p, --predicates TEXT          A comma-separated list of predicates. This
                                 may be a shorthand (i, p) or CURIE
  --autolabel / --no-autolabel   If set, results will automatically have
                                 labels assigned  [default: autolabel]
  -O, --output-type TEXT         Desired output type
  -o, --output FILEN

### Set up an alias

We will set up an alias for running OAK bound to GO for the purposes of this notebook:

In [2]:
alias go runoak -i sqlite:obo:go

In [3]:
go ontology-metadata --all

id:
- obo:go/extensions/go-plus.owl
IAO:0000700:
- GO:0008150
- GO:0005575
- GO:0003674
dce:description:
- The Gene Ontology (GO) provides a framework and set of concepts for describing the
  functions of gene products from all organisms.
dce:title:
- Gene Ontology
dcterms:license:
- cc:by/4.0/
oio:default-namespace:
- gene_ontology
oio:hasOBOFormatVersion:
- '1.2'
owl:versionIRI:
- obo:go/releases/2024-11-03/extensions/go-plus.owl
owl:versionInfo:
- '2024-11-03'
rdf:type:
- owl:Ontology
sh:prefix:
- obo
schema:url:
- http://purl.obolibrary.org/obo/go/extensions/go-plus.owl
rdfs:isDefinedBy:
- http://purl.obolibrary.org/obo/obo.owl


Check that queries work

In [4]:
go info "kinase activity"

GO:0016301 ! kinase activity


### Query for associations to a gene

Here we will query from a previously downloaded GAF all associations to a gene

In [29]:
!curl -L -s http://current.geneontology.org/annotations/sgd.gaf.gz | gzip -dc > input/gene_association.sgd.gaf

In [36]:
go -g input/gene_association.sgd.gaf -G gaf associations -Q subject SGD:S000002305 -O csv

subject	predicate	object	property_values	subject_label	predicate_label	object_label	negated	publications	evidence_type	supporting_objects	primary_knowledge_source	aggregator_knowledge_source	subject_closure	subject_closure_label	object_closure	object_closure_label	comments
SGD:S000002305	involved_in	GO:0006897		LDB17	None	endocytosis	None	PMID:19506040	None		infores:SGD	None					
SGD:S000002305	involved_in	GO:0006897		LDB17	None	endocytosis	None	GO_REF:0000033	None		infores:GO_Central	None					
SGD:S000002305	located_in	GO:0005935		LDB17	None	cellular bud neck	None	PMID:14562095	None		infores:SGD	None					
SGD:S000002305	located_in	GO:0005935		LDB17	None	cellular bud neck	None	PMID:26928762	None		infores:SGD	None					
SGD:S000002305	located_in	GO:0005935		LDB17	None	cellular bud neck	None	GO_REF:0000044	None		infores:UniProt	None					
SGD:S000002305	enables	GO:0003674		LDB17	None	molecular_function	None	GO_REF:0000015	None		infores:SGD	None					
SGD:S000002305	located_in	GO:0030

In [31]:
!egrep -v '\tIBA\t' input/gene_association.sgd.gaf > input/gene_association.no-IBA.sgd.gaf 

In [32]:
go -G gaf diff-associations -p i,p -g input/gene_association.no-IBA.sgd.gaf -X input/gene_association.sgd.gaf -o output/sgd-iba-diff.tsv 

In [33]:
import pandas as pd
df = pd.read_csv("output/sgd-iba-diff.tsv", sep="\t")
df

Unnamed: 0,publications,subject,new_object,is_creation,closure_predicates,closure_delta,new_object_label
0,,SGD:S000004158,GO:1990050,True,rdfs:subClassOf|BFO:0000050,6,phosphatidic acid transfer activity
1,,SGD:S000002516,GO:0005737,True,rdfs:subClassOf|BFO:0000050,11,cytoplasm
2,,SGD:S000003423,GO:0015171,True,rdfs:subClassOf|BFO:0000050,0,amino acid transmembrane transporter activity
3,,SGD:S000003423,GO:0003333,True,rdfs:subClassOf|BFO:0000050,0,amino acid transmembrane transport
4,,SGD:S000000778,GO:0005737,True,rdfs:subClassOf|BFO:0000050,0,cytoplasm
...,...,...,...,...,...,...,...
5046,,SGD:S000000427,GO:0003697,True,rdfs:subClassOf|BFO:0000050,5,single-stranded DNA binding
5047,,SGD:S000000427,GO:0003690,True,rdfs:subClassOf|BFO:0000050,5,double-stranded DNA binding
5048,,SGD:S000005988,GO:0006044,True,rdfs:subClassOf|BFO:0000050,6,N-acetylglucosamine metabolic process
5049,,SGD:S000006131,GO:0043022,True,rdfs:subClassOf|BFO:0000050,3,ribosome binding


In [34]:
df[(~df["new_object_label"].isnull())]

Unnamed: 0,publications,subject,new_object,is_creation,closure_predicates,closure_delta,new_object_label
0,,SGD:S000004158,GO:1990050,True,rdfs:subClassOf|BFO:0000050,6,phosphatidic acid transfer activity
1,,SGD:S000002516,GO:0005737,True,rdfs:subClassOf|BFO:0000050,11,cytoplasm
2,,SGD:S000003423,GO:0015171,True,rdfs:subClassOf|BFO:0000050,0,amino acid transmembrane transporter activity
3,,SGD:S000003423,GO:0003333,True,rdfs:subClassOf|BFO:0000050,0,amino acid transmembrane transport
4,,SGD:S000000778,GO:0005737,True,rdfs:subClassOf|BFO:0000050,0,cytoplasm
...,...,...,...,...,...,...,...
5046,,SGD:S000000427,GO:0003697,True,rdfs:subClassOf|BFO:0000050,5,single-stranded DNA binding
5047,,SGD:S000000427,GO:0003690,True,rdfs:subClassOf|BFO:0000050,5,double-stranded DNA binding
5048,,SGD:S000005988,GO:0006044,True,rdfs:subClassOf|BFO:0000050,6,N-acetylglucosamine metabolic process
5049,,SGD:S000006131,GO:0043022,True,rdfs:subClassOf|BFO:0000050,3,ribosome binding


## Query for associations to a term

In contrast to gene queries, we want to make use of [ontology relationships](https://incatools.github.io/ontology-access-kit/guide/relationships-and-graphs.html) - in particular we typically want to include all is-a and part-of descendants in our query

In [6]:
go -g input/gene_association.sgd.gaf -G gaf associations -p i,p "kinase activity" -O csv | head -20

Negated association: SGD	S000004301	CDC25	NOT	GO:0005886	SGD_REF:S000128073|PMID:18930081	IDA		C	Membrane bound guanine nucleotide exchange factor	YLR310C|CTN1|CDC25'|Ras family guanine nucleotide exchange factor CDC25	gene	taxon:559292	20090401	SGD
Negated association: SGD	S000002266	KIN28	NOT	GO:0019912	SGD_REF:S000046450|PMID:7760796	IDA		F	Serine/threonine protein kinase, subunit of transcription factor TFIIH	YDL108W|TFIIH complex serine/threonine-protein kinase subunit KIN28	gene	taxon:559292	20051102	SGD
Negated association: SGD	S000002266	KIN28	NOT	GO:0019912	SGD_REF:S000046450|PMID:7760796	IMP		F	Serine/threonine protein kinase, subunit of transcription factor TFIIH	YDL108W|TFIIH complex serine/threonine-protein kinase subunit KIN28	gene	taxon:559292	20051102	SGD
Negated association: SGD	S000001460	MRS1	NOT	GO:0004520	SGD_REF:S000068996|PMID:11773622	ISS	UniProtKB:Q03702	F	Splicing protein	YIR021W|PET157	gene	taxon:559292	20030821	SGD
Negated association: SGD	S000001460	MRS

Note that including part of (`p`) does not make a difference with the MF hierarchy in GO, but does
make a big difference in the other two.

### Important: closures make a big difference

Let's compare the number of results with and without closures

In [10]:
go -g input/gene_association.sgd.gaf -G gaf associations -p i,p "kinase activity" -O csv | wc

    3209   32091  315394


In [11]:
go -g input/gene_association.sgd.gaf -G gaf associations "kinase activity" -O csv | wc

     285    2851   26750


## Complex Queries

We can use the OAK graph query language to specify exhaustive lists of direct terms.

For example, not retrieve annotations to any kinase that is not a protein kinase:

In [12]:
go -g input/gene_association.sgd.gaf -G gaf associations  .desc//p=i "kinase activity" .not .desc//p=i "protein kinase activity" -O csv | head -30

subject	predicate	object	object_label	property_values	subject_label	predicate_label	negated	publications	primary_knowledge_source	aggregator_knowledge_source
SGD:S000001369	None	GO:0016301	None		PFK26	None	None	SGD_REF:S000148669	infores:UniProt	None
SGD:S000001369	None	GO:0003873	None		PFK26	None	None	SGD_REF:S000124037	infores:UniProt	None
SGD:S000001369	None	GO:0003873	None		PFK26	None	None	SGD_REF:S000124036	infores:InterPro	None
SGD:S000001369	None	GO:0003873	None		PFK26	None	None	SGD_REF:S000051318|PMID:1322693	infores:SGD	None
SGD:S000001369	None	GO:0003873	None		PFK26	None	None	SGD_REF:S000048479|PMID:1657152	infores:SGD	None
SGD:S000002318	None	GO:0016301	None		STE7	None	None	SGD_REF:S000148669	infores:UniProt	None
SGD:S000000605	None	GO:0004618	None		PGK1	None	None	SGD_REF:S000058483|PMID:6254992	infores:SGD	None
SGD:S000000605	None	GO:0004618	None		PGK1	None	None	SGD_REF:S000124036	infores:InterPro	None
SGD:S000000605	None	GO:0004618	None		PGK1	None	None	SGD_REF:S000124036	i

SGD:S000001357	None	None	GO:0016301	kinase activity	[]
SGD:S000001304	None	None	GO:0016301	kinase activity	[]
SGD:S000006258	None	None	GO:0016301	kinase activity	[]
SGD:S000006135	None	None	GO:0016301	kinase activity	[]
SGD:S000003437	None	None	GO:0016301	kinase activity	[]
SGD:S000003636	None	None	GO:0016301	kinase activity	[]
SGD:S000001861	None	None	GO:0016301	kinase activity	[]
SGD:S000003593	None	None	GO:0016301	kinase activity	[]
SGD:S000003820	None	None	GO:0016301	kinase activity	[]
SGD:S000003866	None	None	GO:0016301	kinase activity	[]
SGD:S000002237	None	None	GO:0016301	kinase activity	[]
SGD:S000003027	None	None	GO:0016301	kinase activity	[]
SGD:S000003494	None	None	GO:0016301	kinase activity	[]
SGD:S000005310	None	None	GO:0016301	kinase activity	[]
SGD:S000005330	None	None	GO:0016301	kinase activity	[]
SGD:S000005200	None	None	GO:0016301	kinase activity	[]
SGD:S000005105	None	None	GO:0016301	kinase activity	[]
SGD:S000004965	None	None	GO:0016301	kinase activity	[]
SGD:S00000

## Querying via API

Some association sources provide an API, so rather than downloading an association file, you have OAK speak to the API.

Note that API endpoints may not support all OAK options; e.g. the amigo endpoint currently forces you to use IDs:

In [13]:
!runoak -i amigo:NCBITaxon:9606 associations -p i,p GO:0016301 | head -30

subject	predicate	object	property_values	subject_label	predicate_label	object_label	negated	publications	primary_knowledge_source	aggregator_knowledge_source
UniProtKB:Q13976	None	GO:0004672		PRKG1	None	protein kinase activity	None	PMID:25447536	BHF-UCL	infores:go
UniProtKB:Q13976	None	GO:0004692		PRKG1	None	cGMP-dependent protein kinase activity	None	PMID:21402151	UniProt	infores:go
UniProtKB:Q13976	None	GO:0004692		PRKG1	None	cGMP-dependent protein kinase activity	None	Reactome:R-HSA-418442	Reactome	infores:go
UniProtKB:Q13976	None	GO:0106310		PRKG1	None	protein serine kinase activity	None	GO_REF:0000116	RHEA	infores:go
UniProtKB:Q9HCP0	None	GO:0004674		CSNK1G1	None	protein serine/threonine kinase activity	None	PMID:25500533	ParkinsonsUK-UCL	infores:go
UniProtKB:Q9HCP0	None	GO:0106310		CSNK1G1	None	protein serine kinase activity	None	GO_REF:0000116	RHEA	infores:go
UniProtKB:Q9HCP0	None	GO:0004674		CSNK1G1	None	protein serine/threonine kinase activity	None	PMID:21873635	GO_Central	inf