# Mapping using OAK: mapping GO to MetaCyc

This illustrates the use of `lexmatch` to map between GO and MetaCyc

First we set up a convenience alias:

In [None]:
alias go runoak -i sqlite:obo:go

Next we will use `--help` to get the inline documentation on `lexmatch`

In [3]:
go lexmatch --help

Usage: runoak lexmatch [OPTIONS] [TERMS]...

  Performs lexical matching between pairs of terms in one more more
  ontologies.

  Examples:

      runoak -i foo.obo lexmatch -o foo.sssom.tsv

  In this example, the input ontology file is assumed to contain all pairs of
  terms to be mapped.

  It is more common to map between all pairs of terms in two ontology files.
  In this case, you can merge the ontologies using a tool like ROBOT; or,  to
  avoid a merge preprocessing step, use the --addl (-a) option to specify a
  second ontology file.

      runoak -i foo.obo --add bar.obo lexmatch -o foo.sssom.tsv

  By default, this command will compare all terms in all ontologies. You can
  use the OAK term query syntax to pass in the set of all terms to be
  compared.

  For example, to compare all terms in union of FOO and BAR namespaces:

      runoak -i foo.obo --add bar.obo lexmatch -o foo.sssom.tsv i^FOO: i^BAR:

  All members of the set are compared (includin

## Using lexmatch

Here we assume a prepared yeastpathways file in an ontology format (yp.obo)

To align we will specify this file with the `-a` option, for additional ontologies.

To restrict the mapping set to those between GO and MetaCyc we pass two argument lists which
are OAK selector queries

- `i^GO:` all terms with an ID that starts with `GO:`
- `i^MetaCyc:` all terms with an ID that starts with `MetaCyc:`

We will choose SSSOM as the output format (default). Note SSSOM is strict about all IDs also being expandable
to URIs, and currently `MetaCyc` isn't in the default OBO list, so we pass this on the command line:

In [8]:
go --prefix MetaCyc=http://identifiers.org/MetaCyc/ -a inputs/yp.obo lexmatch i^GO: @ i^MetaCyc: -o output/go-to-yp.sssom.tsv



Next we will use pandas to explore and view the table

In [9]:
import pandas as pd

In [11]:
df = pd.read_csv("output/go-to-yp.sssom.tsv", comment="#", sep="\t")
df

Unnamed: 0,subject_id,subject_label,predicate_id,object_id,object_label,mapping_justification,mapping_tool,confidence,subject_match_field,object_match_field,match_string
0,GO:0005978,glycogen biosynthetic process,skos:closeMatch,MetaCyc:PWY3O-4031,glycogen biosynthesis,semapv:LexicalMatching,oaklib,0.5,oio:hasExactSynonym,rdfs:label,glycogen biosynthesis
1,GO:0005980,glycogen catabolic process,skos:closeMatch,MetaCyc:GLYCOCAT-YEAST-PWY,glycogen catabolism,semapv:LexicalMatching,oaklib,0.5,oio:hasExactSynonym,rdfs:label,glycogen catabolism
2,GO:0005987,sucrose catabolic process,skos:closeMatch,MetaCyc:SUCUTIL-PWY-2,sucrose degradation,semapv:LexicalMatching,oaklib,0.5,oio:hasExactSynonym,rdfs:label,sucrose degradation
3,GO:0005993,trehalose catabolic process,skos:closeMatch,MetaCyc:TREDEG-YEAST-PWY,trehalose degradation,semapv:LexicalMatching,oaklib,0.5,oio:hasExactSynonym,rdfs:label,trehalose degradation
4,GO:0005998,xylulose catabolic process,skos:closeMatch,MetaCyc:PWY3O-5,xylulose degradation,semapv:LexicalMatching,oaklib,0.5,oio:hasExactSynonym,rdfs:label,xylulose degradation
5,GO:0006001,fructose catabolic process,skos:closeMatch,MetaCyc:PWY3O-0,fructose degradation,semapv:LexicalMatching,oaklib,0.5,oio:hasExactSynonym,rdfs:label,fructose degradation
6,GO:0006031,chitin biosynthetic process,skos:closeMatch,MetaCyc:PWY3O-15,chitin biosynthesis,semapv:LexicalMatching,oaklib,0.5,oio:hasExactSynonym,rdfs:label,chitin biosynthesis
7,GO:0006048,UDP-N-acetylglucosamine biosynthetic process,skos:closeMatch,MetaCyc:UDPNAGSYN-YEAST-PWY,UDP-N-acetylglucosamine biosynthesis,semapv:LexicalMatching,oaklib,0.5,oio:hasExactSynonym,rdfs:label,udp-n-acetylglucosamine biosynthesis
8,GO:0006068,ethanol catabolic process,skos:closeMatch,MetaCyc:PWY3O-4300,ethanol degradation,semapv:LexicalMatching,oaklib,0.5,oio:hasExactSynonym,rdfs:label,ethanol degradation
9,GO:0006097,glyoxylate cycle,skos:closeMatch,MetaCyc:GLYOXYLATE-BYPASS,glyoxylate cycle,semapv:LexicalMatching,oaklib,0.5,rdfs:label,rdfs:label,glyoxylate cycle


## Enhancing mappings with a rule file

Our first pass yielded 51 mappings (many of which were already present as xrefs). But those of us who know GO nomenclature know that this may be
an undercount due to the fact GO uses unusual formal nomenclature not always used by biologists, or
by the curators of MetaCyc.

An example is the use of terms like `X catabolic process` whereas a biologist might just say `X catabolism`.
Also for microbes a common term is `X degradation`

Sometimes GO includes synonyms for these, *but this can't be relied on consistently*.

We will use the *synonymizer* to auto-inject these synonyms.

First we will curate a *rule file*, that looks like this:


In [14]:
!cat inputs/go-match-rules.yaml

rules:
  - description: default
    postconditions:
      predicate_id: skos:closeMatch
      weight: 0.0

  - description: exact to exact
    preconditions:
      subject_match_field_one_of:
        - oio:hasExactSynonym
        - rdfs:label
        - skos:prefLabel
      object_match_field_one_of:
        - oio:hasExactSynonym
        - rdfs:label
        - skos:prefLabel
    postconditions:
      predicate_id: skos:exactMatch
      weight: 2.0

  - description: >-
     label to label; note this is additive with the exact to exact rule,
      so the score just represents an additional small boost
    preconditions:
      subject_match_field_one_of:
        - rdfs:label
      object_match_field_one_of:
        - rdfs:label
    postconditions:
      predicate_id: skos:exactMatch
      weight: 0.5

  - description: xref match
    preconditions:
      subject_match_field_one_of:
        - oio:hasDbXref
        - skos:exactMatch
      object_match_fiel

Note the `synonymizer` replacement rules at the end.

Next we will run lexmatch again, passing the rule file using `-R`

In [12]:
go --prefix MetaCyc=http://identifiers.org/MetaCyc/ -a inputs/yp.obo lexmatch -R inputs/go-match-rules.yaml i^GO: @ i^MetaCyc: -o output/go-to-yp.sssom.tsv



In [15]:
df = pd.read_csv("output/go-to-yp.sssom.tsv", comment="#", sep="\t")
df

Unnamed: 0,subject_id,subject_label,predicate_id,object_id,object_label,mapping_justification,mapping_tool,confidence,subject_match_field,object_match_field,match_string,subject_preprocessing,object_preprocessing
0,GO:0005978,glycogen biosynthetic process,skos:exactMatch,MetaCyc:PWY3O-4031,glycogen biosynthesis,semapv:LexicalMatching,oaklib,0.849779,rdfs:label,rdfs:label,glycogen biosynthesis,semapv:RegularExpressionReplacement,
1,GO:0005978,glycogen biosynthetic process,skos:exactMatch,MetaCyc:PWY3O-4031,glycogen biosynthesis,semapv:LexicalMatching,oaklib,0.800000,oio:hasExactSynonym,rdfs:label,glycogen biosynthesis,,
2,GO:0005980,glycogen catabolic process,skos:exactMatch,MetaCyc:GLYCOCAT-YEAST-PWY,glycogen catabolism,semapv:LexicalMatching,oaklib,0.849779,rdfs:label,rdfs:label,glycogen catabolism,semapv:RegularExpressionReplacement,
3,GO:0005980,glycogen catabolic process,skos:exactMatch,MetaCyc:GLYCOCAT-YEAST-PWY,glycogen catabolism,semapv:LexicalMatching,oaklib,0.800000,oio:hasExactSynonym,rdfs:label,glycogen catabolism,,
4,GO:0005987,sucrose catabolic process,skos:exactMatch,MetaCyc:SUCUTIL-PWY-2,sucrose degradation,semapv:LexicalMatching,oaklib,0.800000,oio:hasExactSynonym,rdfs:label,sucrose degradation,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
78,GO:0070485,"dehydro-D-arabinono-1,4-lactone biosynthetic p...",skos:exactMatch,MetaCyc:PWY3O-6-1,"dehydro-D-arabinono-1,4-lactone biosynthesis",semapv:LexicalMatching,oaklib,0.800000,oio:hasExactSynonym,rdfs:label,"dehydro-d-arabinono-1,4-lactone biosynthesis",,
79,GO:0071269,L-homocysteine biosynthetic process,skos:exactMatch,MetaCyc:PWY-5344,L-homocysteine biosynthesis,semapv:LexicalMatching,oaklib,0.800000,oio:hasExactSynonym,rdfs:label,l-homocysteine biosynthesis,,
80,GO:0071269,L-homocysteine biosynthetic process,skos:exactMatch,MetaCyc:PWY-5344,L-homocysteine biosynthesis,semapv:LexicalMatching,oaklib,0.849779,rdfs:label,rdfs:label,l-homocysteine biosynthesis,semapv:RegularExpressionReplacement,
81,MetaCyc:PWY3O-4153,phenylalanine biosynthesis,skos:narrowMatch,GO:0009094,L-phenylalanine biosynthetic process,semapv:LexicalMatching,oaklib,0.800000,rdfs:label,oio:hasBroadSynonym,phenylalanine biosynthesis,,semapv:RegularExpressionReplacement


In [None]:
This time we get 83 rows, including 

To download: [output/go-to-yp.sssom.tsv](output/go-to-yp.sssom.tsv)

## Future steps

In future this tutorial will also include

- methods to retrieve unmapped terms
- ...