# A `>_` bash kernel Jupyter Notebook for testing Ontologizer on the output of HBA-DEALS

---
**NOTE**

In order to work in this document you have to have installed the required dependencies for Ontologizer. To check if you have Ontologizer available in your system please first type the following:


```bash
java -jar Ontologizer.jar --help
```

If Ontologizer is not available in you Jupyter Notebook Session, please follow the instructions and execute the script:

```bash
bash install_ontologizer.sh
```

---

In [4]:
java -jar Ontologizer.jar --help

usage: java -jar Ontologizer.jar [-a <file>] [-c <arg>] [-d <[thrsh[,id]|id]>] [-f <arg>] [-g
       <file>] [-h] [-i] [-m <arg>] [-n] [-o <arg>] [-p <file>] [-r <arg>] [-s <path>] [-t <arg>]
       [-v]
Analyze High-Throughput Biological Data Using Gene Ontology
 -a,--association <file>      File containing associations from genes to GO terms. Required
 -c,--calculation <arg>       Specifies the calculation method to use. Possible values are: "MGSA",
                              "Parent-Child-Intersection", "Parent-Child-Union" (default),
                              "Term-For-Term", "Topology-Elim", "Topology-Weighted"
 -d,--dot <[thrsh[,id]|id]>   For every study set analysis write out an additional .dot file
                              (GraphViz) containing the graph that is induced by interesting nodes.
                              The optional argument thrsh must be in range between 0 and 1 and it
                              specifies the threshold used to identify interes

# Retrieve required data for `Ontologizer`

The `ontologizer_test.tar.gz` has 4 files inside:

| | FILE | DESCRIPTION|
|--|:---|:---|
|1|**`universe.txt`** | created by writing all the GeneSymbol entries in the HBA-DEALS results table. For retrieving `GeneSymbol`, the Gene column was splitted by `'_'` into `Geneid` and `GeneSymbol`, eg. `ENSG00000004059.11_ARF5` -> `ENSG00000004059.11`, `ARF5`
|2|**`gene_set.txt`** | created by writing the `GeneSymbol` entries after applying a filtering criterion (for this test,  I used `P` < 0.05 and `ExpLogFc` > 1.2) |
|3|**`goa_human.gaf`** | downloaded from here: http://current.geneontology.org/annotations/goa_human.gaf.gz |
|4|**`go.obo`** | downloaded from here: http://purl.obolibrary.org/obo/go.obo |



The release with the example data in `ontologizer_test.tar.gz` can be found at:<br>
https://github.com/cgpu/HBA-DEALS/releases/tag/ontologizer.

To get the url of the `ontologizer_test.tar.gz` file right click and `Copy link adreess` as shown in the gif:

![](http://g.recordit.co/5IcThtkQ6H.gif)



In [9]:
# download file and decompress archive in a folder name ontologizer_test
# mv contents of ontologizer_test in current working dir and delete empty folder ontologizer_test
wget https://github.com/cgpu/HBA-DEALS/releases/download/ontologizer/ontologizer_test.tar.gz && \
tar -xvzf ontologizer_test.tar.gz -C . && \
mv ontologizer_test/* . && \
rm -r ontologizer_test

--2020-04-28 11:51:51--  https://github.com/cgpu/HBA-DEALS/releases/download/ontologizer/ontologizer_test.tar.gz
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/255131379/8417ed80-8949-11ea-98ab-77b53f85156d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200428%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200428T115022Z&X-Amz-Expires=300&X-Amz-Signature=6f47b6aa8cc5dfd57cc5cd48c0846ec8abc7ff8d6efef8316e9eead7c1fcada6&X-Amz-SignedHeaders=host&actor_id=0&repo_id=255131379&response-content-disposition=attachment%3B%20filename%3Dontologizer_test.tar.gz&response-content-type=application%2Foctet-stream [following]
--2020-04-28 11:51:51--  https://github-production-release-asset-2e65be.s3.amazonaws.com/255131379/8417ed80-8949-11ea-98ab-77b53f85156d?X-Amz-Algorithm=AWS4-HMA

# Run `Ontologizer` 
based on [`@karleg's Breast-96-Samples.R (ijc+sjc)`](https://github.com/TheJacksonLaboratory/sbas/commit/c5b1ffcebbbde03057cf85e31e9ae4743103df08#diff-9f1b2a73fd1da7a66fe6b2d603f2f55bR156)

In [11]:
java -jar Ontologizer.jar \
    -g go.obo \
    -a goa_human.gaf \
    -s gene_set.txt \
    -p universe.txt \
    -c Term-For-Term \
    -m Benjamini-Hochberg \
    -n

Parse obo file "go.obo"
Apr 28, 2020 11:54:21 AM ontologizer.ontology.OBOParser doParse
INFO: Got 47439 terms and 94145 relations in 533 ms
Details of parsed obo file:
  date:			null
  format:		1.2
  term definitions:	47439
Building graph
Apr 28, 2020 11:54:21 AM ontologizer.ontology.Ontology assignLevel1TermsAndFixRoot
INFO: Ontology contains multiple level-one terms: "molecular_function" ,"cellular_component" ,"biological_process". Adding artificial root term "GO:0000000".
Apr 28, 2020 11:54:21 AM ontologizer.set.StudySetFactory createFromFile
INFO: Processing studyset gene_set.txt
Apr 28, 2020 11:54:22 AM ontologizer.set.StudySetFactory createFromFile
INFO: Processing studyset universe.txt
Apr 28, 2020 11:54:22 AM ontologizer.association.GAFByteLineScanner newLine
Apr 28, 2020 11:54:22 AM ontologizer.association.GAFByteLineScanner newLine
Apr 28, 2020 11:54:23 AM ontologizer.association.GAFByteLineScanner newLine
Apr 28, 2020 11:54:23 AM ontologizer.association.GAFByteLineScanner ne

In [17]:
head -1 anno-gene_set-Term-For-Term-Benjamini-Hochberg.txt 

ZNF782		annotations={GO:0003677,GO:0005634,GO:0006355,GO:0046872} ancestors_annotations={GO:0000000,GO:0005622,GO:0050789,GO:0009889,GO:0031326,GO:0043231,GO:0006807,GO:0043229,GO:0005488,GO:0043226,GO:0050794,GO:0043227,GO:0031323,GO:0032774,GO:1901576,GO:0009058,GO:0009059,GO:0008150,GO:0090304,GO:0008152,GO:0051171,GO:2001141,GO:0034641,GO:0003674,GO:0003676,GO:0034645,GO:0046483,GO:0006139,GO:0051252,GO:0034654,GO:0019219,GO:0005575,GO:0060255,GO:0080090,GO:0019222,GO:0018130,GO:0010468,GO:0071704,GO:0010467,GO:0009987,GO:0097659,GO:2000112,GO:0019438,GO:0065007,GO:0044249,GO:1901362,GO:1901363,GO:1901360,GO:0043167,GO:0044238,GO:0044237,GO:0006725,GO:0016070,GO:0097159,GO:0043169,GO:0044271,GO:0043170,GO:1903506,GO:0006351,GO:0110165,GO:0044260,GO:0010556}


In [19]:
head -4 table-gene_set-Term-For-Term-Benjamini-Hochberg.txt

ID	Pop.total	Pop.term	Study.total	Study.term	p	p.adjusted	p.min	name
GO:0009987	7922	6714	5219	4513	2.9524997368388813E-9	2.201587132545262E-5	0.0	"cellular process"
GO:0008150	7922	6940	5219	4654	3.79613077203844E-9	2.201587132545262E-5	0.0	"biological_process"
GO:0003723	7922	1185	5219	866	5.1584023764740816E-9	2.201587132545262E-5	0.0	"RNA binding"
