# GOATOOLS: A Python library for Gene Ontology analyses

D. V. Klopfenstein, Liangsheng Zhang, Brent S. Pedersen, Fidel Ramírez, Alex Warwick Vesztrocy, Aurélien Naldi, Christopher J. Mungall, Jeffrey M. Yunes, Olga Botvinnik, Mark Weigel, Will Dampier, Christophe Dessimoz, Patrick Flick, Haibao Tang

# International Group of Contributors

* School of Biomedical Engineering, Science, and Health Systems, Drexel University, **Philadelphia, PA, USA**
* Center for Genomics and Biotechnology, Fujian Agriculture and Forestry University, **Fuzhou, China**
* Department of Human Genetics, University of **Utah, Salt Lake City, UT, USA**
* Max Planck Institute of Immunobiology and Epigenetics, **Freiburg, Germany**
* Department of Computational Biology, University of Lausanne, **Lausanne, Switzerland**
* Center for Integrative Genomics, University of Lausanne, **Lausanne, Switzerland**
* Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, **Berkeley, CA, USA**
* UC Berkeley - UCSF Graduate Program in Bioengineering, University of California, **San Francisco, CA, USA**
* Bioinformatics and Systems Biology Program, University of California, **San Diego, CA, USA**
* Independent Researcher, **Philadelphia, PA, USA**
* Department of Microbiology and immunology, Drexel University College of Medicine, **Philadelphia, PA, USA**
* School of Computational Science and Engineering, Georgia Institute of Technology, **Atlanta, GA, USA**

# From Universities all over the World

* School of Biomedical Engineering, Science, and Health Systems, **Drexel University**, Philadelphia, PA, USA
* Center for Genomics and Biotechnology, **Fujian Agriculture and Forestry University**, Fuzhou, China
* Department of Human Genetics, **University of Utah**, Salt Lake City, UT, USA
* **Max Planck Institute of Immunobiology and Epigenetics**, Freiburg, Germany
* Department of Computational Biology, **University of Lausanne**, Lausanne, Switzerland
* Center for Integrative Genomics, **University of Lausanne**, Lausanne, Switzerland
* Division of Environmental Genomics and Systems Biology, **Lawrence Berkeley National Laboratory**, Berkeley, CA, USA
* **UC Berkeley** - UCSF Graduate Program in Bioengineering, University of California, San Francisco, CA, USA
* Bioinformatics and Systems Biology Program, **University of California, San Diego**, CA, USA
* Department of Microbiology and immunology, **Drexel University College of Medicine**, Philadelphia, PA, USA
* School of Computational Science and Engineering, **Georgia Institute of Technology**, Atlanta, GA, USA

## Why
Gene Ontology (GO) is used to describe gene products in a computationally acessible manner. 

The GOATOOLS Python library and scripts can be used to: 

* Feature set for genes and gene products
* Query the GO and its annotations to gene producs
* Run enrichment analyses on sets of genes
* Group GO terms using the researchers knowledge

# What is Gene Ontology?

A set of terms that describe gene product functions.

# GO Term: germinal center formation

fieldname       | value
----------------|---------------------
id              | GO:0002467
name            | germinal center formation
def             | The process in which germinal centers form. A germinal center is a specialized microenvironment formed when activated B cells enter lymphoid follicles. Germinal centers are the foci for B cell proliferation and somatic hypermutation. 
GO_REF:0000022  |https://github.com/geneontology/go-site/blob/master/metadata/gorefs/goref-0000022.md
ISBN:081533642X |![book](images/BookImmunology50.png)

# GO Terms are related to one another
![GO relationships](images/fig2.png)

# Three Branches

GO Term    | NS |  dcnt |level|depth| branch
-----------|----|-------|-----|-----|--------------------
GO:0008150 | BP | 29698 | L00 | D00 | biological_process
GO:0003674 | MF | 11147 | L00 | D00 | molecular_function
GO:0005575 | CC |  4201 | L00 | D00 | cellular_component


# Layout of Biological Process Branch
![BP layout](images/fig1.png)

# Depth-01 GO Terms for Biological Process

A | dcnt |dep|GO Term   |name
--|------|---|----------|---------------------
  |29,698|D00|GO:0008150|biological_process
A |12,811|D01|GO:0065007|biological regulation
B |11,254|D01|GO:0009987|cellular process
C | 6,399|D01|GO:0008152|metabolic process
D | 3,219|D01|GO:0032502|developmental process
E | 2,283|D01|GO:0050896|response to stimulus
F | 2,120|D01|GO:0051179|localization
G | 1,740|D01|GO:0071840|cellular component organization or biogenesis
H | 1,474|D01|GO:0051704|multi-organism process
I | 1,000|D01|GO:0032501|multicellular organismal process
J |   843|D01|GO:0022414|reproductive process
K |   572|D01|GO:0002376|immune system process
L |   402|D01|GO:0040011|locomotion
M |   218|D01|GO:0007610|behavior
N |   160|D01|GO:0008283|cell proliferation
O |   157|D01|GO:0040007|growth
P |   128|D01|GO:0022610|biological adhesion
Q |   125|D01|GO:0023052|signaling
R |    62|D01|GO:0044848|biological phase
S |    50|D01|GO:0048511|rhythmic process
T |    48|D01|GO:0098754|detoxification
U |    32|D01|GO:0000003|reproduction
V |    31|D01|GO:0001906|cell killing
W |    17|D01|GO:0043473|pigmentation
X |    13|D01|GO:0098743|cell aggregation

# Depth Varies Across All Three Branches

Dep|Depth Counts  |||Level Counts |||
---|----------------|-----------------
**Lev**|  **BP**|   **MF**|   **CC**|   **BP**|   **MF**|   **CC**
---|----|-----|-----|-----|-----|-----
00 |   1|    1|    1|    1|    1|    1
01 |  29|   15|   21|   29|   15|   21
02 | 264|  126|  347|  420|  146|  746
03 |1272|  571|  492| 2213|  869| 1070
04 |2375| 1538|  735| 4856| 2098| 1360
05 |3702| 4815|  910| 7295| 5051|  696
06 |4477| 1830|  786| 7276| 1929|  229
07 |4695|  969|  598| 4694|  727|   68
08 |4209|  573|  256| 1999|  201|   10
09 |3513|  311|   51|  634|   79|    1
10 |2402|  153|    4|  244|   13|    0
11 |1516|  140|    1|   38|   19|    0
12 | 853|   42|    0|    0|    0|    0
13 | 307|   35|    0|    0|    0|    0
14 |  66|   21|    0|    0|    0|    0
15 |  14|    7|    0|    0|    0|    0
16 |   4|    1|    0|    0|    0|    0



# Grouping

# List of 12 GO Terms
![plain GO list](images/az_ungrouped_bw.png)

# Some GO terms are related
![Unsorted Grouped GO Terms](images/az_ungrouped_rgb.png)

# Group related GO terms
![Group/Sorted GO terms](images/az_grouped_rgb.png)

# Flexible Grouping: Research can guide choices
![Two grouping choices](./images/fig2.png)

# Group using a Sections File

# Generate a new Sections File

**Running this**:
```
$ scripts/wr_sections.py goids.txt
hdr GOs(0 in 0 sections,  61 unused) WROTE: sections_in.txt
hdr GOs(0 in 0 sections,  61 unused) WROTE: sections.txt
usr GOs(0 in 0 sections, 840 ungrpd) WROTE: grouped_gos.txt
   840 user GO IDs

```

**Creates sections_in.txt (ASCII text file)**:
```
# SECTION: Misc.
GO:0008150  # BP ** 29625 L00 D00 R00   biological_process
GO:0032502  # BP **  6473 L01 D01 R01 E developmental process
GO:0050896  # BP **  6004 L01 D01 R01 F response to stimulus
GO:0002376  # BP **  1796 L01 D01 R01 K immune system process
GO:0048646  # BP **   878 L02 D02 R02 E anatomical structure ...
```


# Add User-defined Sections

**Original sections file:**
```
# SECTION: Misc.
GO:0008150  # BP ** 29625 L00 D00 R00   biological_process
GO:0032502  # BP **  6473 L01 D01 R01 E developmental process
GO:0050896  # BP **  6004 L01 D01 R01 F response to stimulus
GO:0002376  # BP **  1796 L01 D01 R01 K immune system process
GO:0048646  # BP **   878 L02 D02 R02 E anatomical structure ...
```

**User-defined sections:**
```
# SECTION: immune
GO:0002376  # BP **  1796 L01 D01 R01 K immune system process

# SECTION: development
GO:0032502  # BP **  6473 L01 D01 R01 E developmental process
GO:0048646  # BP **   878 L02 D02 R02 E anatomical structure ...

# SECTION: stimulus
GO:0050896  # BP **  6004 L01 D01 R01 F response to stimulus

# SECTION: Misc.
GO:0008150  # BP ** 29625 L00 D00 R00   biological_process
```

![Two grouping choices](./images/fig2.png)

# Move _germinal center formation_
Move _germinal center formation_ from the **development** section to the **immune** section:
```
# SECTION: immune
GO:0002376  # BP **  1796 L01 D01 R01 K immune system process
GO:0006955  # immune response

# SECTION: development
GO:0032502  # BP **  6473 L01 D01 R01 E developmental process
GO:0048646  # BP **   878 L02 D02 R02 E anatomical structure ...

# SECTION: stimulus
GO:0050896  # BP **  6004 L01 D01 R01 F response to stimulus

# SECTION: Misc.
GO:0008150  # BP ** 29625 L00 D00 R00   biological_process
```

# Compare GOATOOLS GOEAs to Other Tools

* GO Terms returned: Specific vs Broad
* Genes Associated with GO Terms
* P-values across tools
* Genes returned across tools


# GO Terms enriched: Specific vs Broad
![specific_broad](images/cmp1_specific_vs_broad.png)

# P-values
![pvals](images/cmp2_pvals.png)

# Genes Associated with GO Terms
![assc](images/cmp3_assc.png)

# Genes returned
![genes returned](images/suppfig4.png)

#  GOATOOLS found more statistically signiﬁcant GO terms than found by DAVID6.8 when using the same annotations
![GOATOOLS and DAVID](images/goatools_v_david6p8.png)

# Stochastic GOEA Simulations with GOATOOLS

# First Simulations Failed
![fail](images/suppfig1.png)

# Remove 30 Broad GOs from 17,000 total
![sim_p0](images/sim_p0.png)

# Turn Propagate Counts On
![sim_p1](images/sim_p1.png)

## Conclusion

The authors were able to show that thier architecture has some useful properties:
  - Basic properties of DNA can be captured without explicit descritions. 
    - $-z_i$ is the complement of $z_i$.
    - DNA motifs and splice junctions can be replicated.
  - The model exhibits useful mathematical properties.
    - Interpolations in latent-space correspond to smooth transitions in sequence space.
    - Generators can be hooked head-to-tail with any predictive model to generate simple tools for optimizing literally anything. The predictive model does not have to be a neural network. It can be anything from a pattern matching algorithm to a webservice.

### Things I wish they did

- Explore how the model architecture impacts the mapping. Why 5 layers of residual blocks? Why not 2, why not 20? How would you even measure that? I would've liked a little more discussion around those topics.
- In newer GAN literature (possibly coming out *after* this paper) people have started including a third part to the overall model. A *reverse-generator*, this maps a sequence to its corresponding position in the latent space. This allows you find where in Z-space a particular sequence of interest is. Then you can find the most optimal sequence **near your own**.

### Things I think we can use this for

- Generate gRNAs! We can collect a large number of gRNAs from across HIV and use those as a library to train a generator. Then use CRSeek and DeepCRISPR to evaluate how effective a gRNA will be at cleaving a sequence and how well it will work across a population.
- Generate seemingly realistic HIV sequences. Can we extend them to full genes? Can we penalize those that include stop codons (defunct viruses)? Can we include a subtype category to condition the output?
- Generate broadly neutralizing antibodies. Although some Googling hasn't really turned up a good database of antibodies ... I can't even figure out how one even describes the sequence an antibody recognizes. I think this is a really fruitful area. Know anyone who needs a project?
- I think there's a cool side project in here:
  - Train the model on Tat sequences incorperating the *reverse-generator*.
  - Hook the GO predictor model we have to the end of the generator ... probably limit the output to a single neuron, like the closest related P-TEFB binding.
  - Use optimization to find a collection of sequences to test.


