Merge pull request #397 from INCATools/1.2.26-fixes

1.2.26 fixes
INCATools · Feb 11, 2021 · 39ae587 · 39ae587
2 parents 4c88f99 + 3c759e1
commit 39ae587
Show file tree

Hide file tree

Showing 9 changed files with 311 additions and 23 deletions.
diff --git a/Changes.md b/Changes.md
@@ -1,3 +1,37 @@
+# v1.2.26 (10 February 2021): HOTFIXES
+- Hotfixes:
+  - The new mireot module technique was buggy and is therefore removed again. Sorry; we will try again next time. You can still use the `custom` option to implement mireot yourself!
+  - A change in the way imports were processed introduced a very high memory footprint for large ontologies and slowed stuff down. If you do not have a lot of memory (and time!) available, you should use the following new flags: `is_large` and `use_gzipped`. `is_large: TRUE` introduces a special handling for the ontology that is faster and consumes less memory when creating an import. Using `use_gzipped` will try to download the ontology from its gzipped location. Make sure its actually there (we know its the case for chebi and pr at least)!
+```
+import_group:
+  products: 
+    - id: pr
+      use_gzipped: TRUE
+      is_large: TRUE
+    - id: chebi
+      use_gzipped: TRUE
+      is_large: TRUE
+```
+  - An irrelevant file (keeprelations.txt) was still generated even if needed when seeding a new repo.
+  - Module type `STAR` was accidentally hard coded default for slme. Now changed to `BOT` as it was.
+  - CI configs where not correctly copied by update routine. Now it does. Note for the changes to be picked up, you need to run `sh run.sh make update_repo` twice (once for updating the update script itself)!
+  - Geeky (but necessary) all phony make goals are now correctly declared as `.PHONY`.
+- Some last minute features:
+  - In new repos, the README.md is now generated with the correct, appropriate banners.
+  - We now have a new feature, `custom_makefile_header`, that allows injecting a custom header into the Makefile. Most mortals wont need this, but this is how it goes:
+```
+custom_makefile_header: |
+  ### Workflow
+  #
+  # Tasks to edit and release OMRSE.
+  #
+  # #### Edit
+  #
+  # 1. [Prepare release](prepare_release)
+  # 2. [Refresh imports](all_imports)
+  # 3. [Update repo to latest ODK](update_repo)
+```
+- all features and fixes here: https://github.com/INCATools/ontology-development-kit/pull/397
 
 # v1.2.26 (2 February 2021)
 - New versions:

diff --git a/docs/DealWithLargeOntologies.md b/docs/DealWithLargeOntologies.md
@@ -0,0 +1,161 @@
+# Dealing with huge ontologies in your import chain
+
+Dealing with very large ontologies, such as Protein Ontology (PR), NCBI Taxonomy (NXBITaxon), Gene Ontology (GO) and CHEBI is a big challenge when developing ontologies, especially if we want to import and re-use terms from them. There are two major problems:
+1. It currently takes about 12-16 GB of memory to process PR and NCBITaxon - memory that many of us do not have available.
+2. The files are so large, pulling them over the internet can lead to failures, timeouts and other problems. 
+
+There are a few strategies we can employ to deal with the problem of memory consumption:
+1. We try to reduce the memory footprint of the import as much as possible. In other words: we try to not do the fancy stuff ODK does by default when extracting a module, and keep it simple.
+2. We manage the import manually ourselves (no import)
+
+To deal with file size, we:
+1. Instead of importing the whole thing, we import curated subsets.
+2. If available, we use gzipped (compressed) versions.
+
+All four strategies will be discussed in the following. We will then look a bit 
+
+## Overwrite ODK default: less fancy, custom modules
+
+The default recipe for creating a module looks something like that:
+
+```
+imports/%_import.owl: mirror/%.owl imports/%_terms_combined.txt
+	if [ $(IMP) = true ]; then $(ROBOT) query  -i $< --update ../sparql/preprocess-module.ru \
+		extract -T imports/$*_terms_combined.txt --force true --copy-ontology-annotations true --individuals exclude --method BOT \
+		query --update ../sparql/inject-subset-declaration.ru --update ../sparql/postprocess-module.ru \
+		annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi
+
+.PRECIOUS: imports/%_import.owl
+```
+(Note: This snippet was copied her on the 10 February 2021 and may be out of date by the time you read this.)
+
+As you can see, a lot of stuff is going on here: first we run some preprocessing (which is really costly in ROBOT, as we need to load the ontology into Jena, and then back into the OWL API - so basically the ontology is loaded three times in total), then extract a module, then run more sparql queries etc etc. Costly. For small ontologies, this is fine. All of these processes are important to mitigate some of the shortcomings of module extraction techniques, but even if they would be sorted in ROBOT, it may still not be enough.
+
+So what we can do now is this. In your `ont.Makefile` (for example, `go.Makefile`, NOT `Makefile`), located in `src/ontology`, you can add a snippet like this:
+
+```
+imports/pr_import.owl: mirror/pr.owl imports/pr_terms_combined.txt
+	if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/pr_terms_combined.txt --force true --method BOT \
+		annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi
+
+.PRECIOUS: imports/pr_import.owl
+```
+
+Note that all the `%` variables and uses of `$*` are replaced by the ontology id in question. Adding this to your `ont.Makefile` will overwrite the default ODK behaviour in favour of this new recipe.
+
+_The ODK supports this reduced module out of the box. To activate it, do this:_
+
+```
+import_group:
+  products: 
+    - id: pr
+      use_gzipped: TRUE
+      is_large: TRUE
+```
+
+This will (a) ensure that PR is pulled from a gzipped location (you _have_ to check whether it exists though. It must correspond to the PURL, followed by the extension `.gz`, for example `http://purl.obolibrary.org/obo/pr.owl.gz`) and (b) that it is considered large, so the default handling of large imports is activated for `pr`, and you dont need to paste anything int `ont.Makefile`.
+
+If you prefer to do it yourself, in the following you can find a few snippets you can use that work for three large ontologies. Just copy them and drop them into `ont.Makefile`; and adjust them however you wish.
+
+### Protein Ontology (PR)
+
+```
+imports/pr_import.owl: mirror/pr.owl imports/pr_terms_combined.txt
+	if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/pr_terms_combined.txt --force true --method BOT \
+		annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi
+
+.PRECIOUS: imports/pr_import.owl
+```
+
+### NCBI Taxonomy (NCBITaxon)
+
+```
+imports/ncbitaxon_import.owl: mirror/ncbitaxon.owl imports/ncbitaxon_terms_combined.txt
+	if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/ncbitaxon_terms_combined.txt --force true --method BOT \
+		annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi
+
+.PRECIOUS: imports/ncbitaxon_import.owl
+```
+
+### CHEBI
+
+```
+imports/chebi_import.owl: mirror/chebi.owl imports/chebi_terms_combined.txt
+	if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/chebi_terms_combined.txt --force true --method BOT \
+		annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi
+
+.PRECIOUS: imports/chebi_import.owl
+```
+
+Feel free to use an even cheaper approach - even one that does not use ROBOT -> as long as it produces the target of the goal (e.g. `imports/chebi_import.owl`).
+
+## Use, slims when they are available
+
+For some ontologies, you can find slims that are _much_ smaller than full ontology. For example, NCBITaxon maintains a slim for OBO here: http://purl.obolibrary.org/obo/ncbitaxon/subsets/taxslim.obo, which is just 3 M(!!)B compared to the 1 or 2 GB of the full version. Many ontologies maintain such slims, and if not, probably should (I would really like to see an OBO slim for Protein Ontology!).
+
+You can also add your favourite Taxa to that slim by simply making a pull request on here: https://github.com/obophenotype/ncbitaxon/blob/master/subsets/taxon-subset-ids.txt
+
+You can use those slims simply like this:
+
+```
+import_group:
+  products: 
+    - id: ncbitaxon
+      mirror_from: http://purl.obolibrary.org/obo/ncbitaxon/subsets/taxslim.obo
+```
+
+## Manage imports manually
+
+This is a real hack, and we want to discourage it with very strong terms. But sometimes, importing an ontology just to import a single term is total overkill. What we do in these cases is to maintain a simple template to "import" minimal information. I cant stress enough that we want to avoid this, as such information must necessarily get out of date, but here is a pattern you can use to handle it in an ok way:
+
+Add this to your `src/ontology/ont-odk.yaml`
+
+```
+import_group:
+  products: 
+    - id: my_ncbitaxon
+```
+
+Then add this to `src/ontology/ont.Makefile`:
+
+```
+mirror/my_ncbitaxon.owl:
+	echo "No mirror for $@"
+
+imports/my_ncbitaxon_import.owl: imports/my_ncbitaxon_import.tsv
+	if [ $(IMP) = true ]; then $(ROBOT) template --template $< \
+  --ontology-iri "$(ONTBASE)/$@" --output $@.tmp.owl && mv $@.tmp.owl $@; fi
+
+.PRECIOUS: imports/my_ncbitaxon_import.owl
+```
+
+Now you can manage your import manually in the template, and the ODK will know not to include your manually curated import in your base release. But again, avoid this pattern for anything but the most trivial case (e.g. you need 1 term from a huge ontology).
+
+
+## File is too large: Network timeouts and long runtimes
+
+Remember that ontologies are text files. While this makes them easy to read im your browser, it also makes them huge - from 500 MB (Chebi) to 2 GB (NCBITaxon) - which is an enormous amount. 
+
+
+Thankfully, ROBOT can automatically read gzipped ontologies without the need of unpacking. To avoid long runtimes and network timeouts, we can do the following two things (with the new ODK 1.2.26):
+
+```
+import_group:
+  products: 
+    - id: pr
+      use_gzipped: TRUE
+```
+This will try to append `.gz` to the default download location (http://purl.obolibrary.org/obo/pr.owl ---> http://purl.obolibrary.org/obo/pr.owl.gz). Note that you must make sure that this file actually exists. It does, for Chebi and Protein Ontology, but not for many others.
+
+
+If the file exists, but is located elsewhere, you can do this:
+
+```
+import_group:
+  products: 
+    - id: pr
+      mirror_from: http://purl.obolibrary.org/obo/pr.owl.gz
+```
+
+You can put any URL in `mirror_from` (including non-obo ones!).
+
diff --git a/odk/odk.py b/odk/odk.py
@@ -104,6 +104,15 @@ class ImportProduct(Product):
 
     mirror_from: Optional[Url] = None
     """if specified this URL is used rather than the default OBO PURL for the main OWL product"""
+
+    is_large: bool = False
+    """if large, ODK may take measures to reduce the memory footprint of the import."""
+
+    use_base: bool = False
+    """if use_base is true, try use the base IRI instead of normal one to mirror from."""
+
+    use_gzipped: bool = False
+    """if use_gzipped is true, try use the base IRI instead of normal one to mirror from."""
 
 @dataclass_json
 @dataclass
@@ -228,7 +237,7 @@ class ImportGroup(ProductGroup):
     """all import products"""
 
     module_type : str = "slme"
-    """Module type. Supported: slme, mireot, minimal, custom"""
+    """Module type. Supported: slme, minimal, custom"""
 
     module_type_slme : str = "BOT"
     """SLME module type. Supported: BOT, TOP, STAR"""
@@ -419,6 +428,12 @@ class OntologyProject(JsonSchemaMixin):
 
     use_dosdps : bool = False
     """if true use dead simple owl design patterns"""
+
+    custom_makefile_header : str = """
+# ----------------------------------------
+# More information: https://github.com/INCATools/ontology-development-kit/
+"""
+    """A multiline string that is added to the Makefile"""
 
     public_release : str = "none"
     """if true add functions to run automated releases (experimental). Current options are: github_curl, github_python."""

diff --git a/schema/project-schema.json b/schema/project-schema.json
@@ -326,6 +326,10 @@
             },
             "type": "array"
         },
+        "custom_makefile_header": {
+            "default": "\n# ----------------------------------------\n# More information: https://github.com/INCATools/ontology-development-kit/\n",
+            "type": "string"
+        },
         "description": {
             "default": "None",
             "type": "string"

diff --git a/template/README.md.jinja2 b/template/README.md.jinja2
@@ -1,5 +1,8 @@
+{%- if project.ci is defined -%}{% if 'travis' in project.ci %}
 [![Build Status](https://travis-ci.org/{{ project.github_org }}/{{ project.repo }}.svg?branch=master)](https://travis-ci.org/{{ project.github_org }}/{{ project.repo }})
-[![DOI](https://zenodo.org/badge/13996/{{ project.github_org }}/{{ project.repo }}.svg)](https://zenodo.org/badge/latestdoi/13996/{{ project.github_org }}/{{ project.repo }})
+{%- endif -%}{% if 'github_actions' in project.ci %}
+![Build Status](https://github.com/{{ project.github_org }}/{{ project.repo }}/workflows/CI/badge.svg)
+{% endif %}{% endif -%}
 
 # {{ project.title }}