# GOATOOLS: A Python library for Gene Ontology analyses

D. V. Klopfenstein, Liangsheng Zhang, Brent S. Pedersen, Fidel Ramírez, Alex Warwick Vesztrocy, Aurélien Naldi, Christopher J. Mungall, Jeffrey M. Yunes, Olga Botvinnik, Mark Weigel, Will Dampier, Christophe Dessimoz, Patrick Flick, Haibao Tang

# International Group of Contributors

* School of Biomedical Engineering, Science, and Health Systems, Drexel University, **Philadelphia, PA, USA**
* Center for Genomics and Biotechnology, Fujian Agriculture and Forestry University, **Fuzhou, China**
* Department of Human Genetics, University of **Utah, Salt Lake City, UT, USA**
* Max Planck Institute of Immunobiology and Epigenetics, **Freiburg, Germany**
* Department of Computational Biology, University of Lausanne, **Lausanne, Switzerland**
* Center for Integrative Genomics, University of Lausanne, **Lausanne, Switzerland**
* Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, **Berkeley, CA, USA**
* UC Berkeley - UCSF Graduate Program in Bioengineering, University of California, **San Francisco, CA, USA**
* Bioinformatics and Systems Biology Program, University of California, **San Diego, CA, USA**
* Independent Researcher, **Philadelphia, PA, USA**
* Department of Microbiology and immunology, Drexel University College of Medicine, **Philadelphia, PA, USA**
* School of Computational Science and Engineering, Georgia Institute of Technology, **Atlanta, GA, USA**
<body><p style = "font-family:georgia,garamond,serif;font-size:xxsmall;font-style:italic;"><center>Copyright (C) 2015-2019, DV Klopfenstein. All rights reserved.</center></p></body>

# From Universities all over the World

* School of Biomedical Engineering, Science, and Health Systems, **Drexel University**, Philadelphia, PA, USA
* Center for Genomics and Biotechnology, **Fujian Agriculture and Forestry University**, Fuzhou, China
* Department of Human Genetics, **University of Utah**, Salt Lake City, UT, USA
* **Max Planck Institute of Immunobiology and Epigenetics**, Freiburg, Germany
* Department of Computational Biology, **University of Lausanne**, Lausanne, Switzerland
* Center for Integrative Genomics, **University of Lausanne**, Lausanne, Switzerland
* Division of Environmental Genomics and Systems Biology, **Lawrence Berkeley National Laboratory**, Berkeley, CA, USA
* **UC Berkeley** - UCSF Graduate Program in Bioengineering, University of California, San Francisco, CA, USA
* Bioinformatics and Systems Biology Program, **University of California, San Diego**, CA, USA
* Department of Microbiology and immunology, **Drexel University College of Medicine**, Philadelphia, PA, USA
* School of Computational Science and Engineering, **Georgia Institute of Technology**, Atlanta, GA, USA
<body><p style = "font-family:georgia,garamond,serif;font-size:xxsmall;font-style:italic;"><center>Copyright (C) 2015-2019, DV Klopfenstein. All rights reserved.</center></p></body>

## Why
Gene Ontology (GO) is used to describe gene products in a computationally acessible manner. 

The GOATOOLS Python library and scripts can be used to: 

* Feature set for genes and gene products
* Query the GO and its annotations to gene products
* Run enrichment analyses on sets of genes
* Group GO terms using the researchers knowledge
<body><p style = "font-family:georgia,garamond,serif;font-size:xxsmall;font-style:italic;"><center>Copyright (C) 2015-2019, DV Klopfenstein. All rights reserved.</center></p></body>

# What is Gene Ontology?

A set of terms that describe gene product functions.
<!-- .element height="50%" width="50%" border-style= solid-->
<body><p style = "font-family:georgia,garamond,serif;font-size:xxsmall;font-style:italic;"><center>Copyright (C) 2015-2019, DV Klopfenstein. All rights reserved.</center></p></body>

# GO Term: germinal center formation

fieldname       | value
----------------|---------------------
id              | GO:0002467
name            | germinal center formation
def             | The process in which germinal centers form. A germinal center is a specialized microenvironment formed when activated B cells enter lymphoid follicles. Germinal centers are the foci for B cell proliferation and somatic hypermutation. 
GO_REF:0000022  |https://github.com/geneontology/go-site/blob/master/metadata/gorefs/goref-0000022.md
ISBN:081533642X |![book](images/BookImmunology50.png)

# GO Terms are related to one another
<img src="images/fig2.png" height="700" width="700">
2018 Klopfenstein DV, Zhang L, Pedersen BS, ... Tang H. Scientific Reports

# Three Branches

GO Term    | NS |  dcnt |level|depth| branch
-----------|----|-------|-----|-----|--------------------
GO:0008150 | BP | 29698 | L00 | D00 | biological_process
GO:0003674 | MF | 11147 | L00 | D00 | molecular_function
GO:0005575 | CC |  4201 | L00 | D00 | cellular_component



<body><p style = "font-family:georgia,garamond,serif;font-size:xxsmall;font-style:italic;"><center>Copyright (C) 2015-2019, DV Klopfenstein. All rights reserved.</center></p></body>

# Layout of Biological Process Branch
![BP layout](images/fig1.png)
2018 Klopfenstein DV, Zhang L, Pedersen BS, ... Tang H. Scientific Reports

# Depth-01 GO Terms for Biological Process

A | dcnt |dep|GO Term   |name
--|------|---|----------|---------------------
  |29,698|D00|GO:0008150|biological_process
A |12,811|D01|GO:0065007|biological regulation
B |11,254|D01|GO:0009987|cellular process
C | 6,399|D01|GO:0008152|metabolic process
D | 3,219|D01|GO:0032502|developmental process
E | 2,283|D01|GO:0050896|response to stimulus
F | 2,120|D01|GO:0051179|localization
G | 1,740|D01|GO:0071840|cellular component organization or biogenesis
H | 1,474|D01|GO:0051704|multi-organism process
I | 1,000|D01|GO:0032501|multicellular organismal process
J |   843|D01|GO:0022414|reproductive process
K |   572|D01|GO:0002376|immune system process
L |   402|D01|GO:0040011|locomotion
M |   218|D01|GO:0007610|behavior
N |   160|D01|GO:0008283|cell proliferation
O |   157|D01|GO:0040007|growth
P |   128|D01|GO:0022610|biological adhesion
Q |   125|D01|GO:0023052|signaling
R |    62|D01|GO:0044848|biological phase
S |    50|D01|GO:0048511|rhythmic process
T |    48|D01|GO:0098754|detoxification
U |    32|D01|GO:0000003|reproduction
V |    31|D01|GO:0001906|cell killing
W |    17|D01|GO:0043473|pigmentation
X |    13|D01|GO:0098743|cell aggregation

# Depth Varies Across All Three Branches

Dep|Depth Counts  |||Level Counts |||
---|----------------|-----------------
**Lev**|  **BP**|   **MF**|   **CC**|   **BP**|   **MF**|   **CC**
---|----|-----|-----|-----|-----|-----
00 |   1|    1|    1|    1|    1|    1
01 |  29|   15|   21|   29|   15|   21
02 | 264|  126|  347|  420|  146|  746
03 |1272|  571|  492| 2213|  869| 1070
04 |2375| 1538|  735| 4856| 2098| 1360
05 |3702| 4815|  910| 7295| 5051|  696
06 |4477| 1830|  786| 7276| 1929|  229
07 |4695|  969|  598| 4694|  727|   68
08 |4209|  573|  256| 1999|  201|   10
09 |3513|  311|   51|  634|   79|    1
10 |2402|  153|    4|  244|   13|    0
11 |1516|  140|    1|   38|   19|    0
12 | 853|   42|    0|    0|    0|    0
13 | 307|   35|    0|    0|    0|    0
14 |  66|   21|    0|    0|    0|    0
15 |  14|    7|    0|    0|    0|    0
16 |   4|    1|    0|    0|    0|    0



# Gjoneska-Pfenning Data set
<img src="images/Gjoneska_Pfenning_2015_02.png" height="700px" width="100%">

# Gjoneska-Pfenning Data set
<img src="images/Gjoneska_Pfenning_2015_02.png" width="800">

# Neurotoxic stresses induce Ca+2 influx, causing aberrant activation of Cdk5
<img src="images/fnagi-06-00232-g001.jpg" height="400" width="600">
2014 | Castro-Alvarez JF, Uribe-Arias SA, Mejía-Raigosa D, Cardona-Gómez GP.
_Cyclin-dependent kinase 5, a node protein in diminished tauopathy: a systems biology approach_


In [6]:
# Tau normally binds to and stabilizes microtubles.
# But in AD tau detached from microtubleus and adheres to other tau modelcules forming tangles inside neurons

# Alzheimer's disease inducible mouse model
### **Human AD brains show significant p25 increase**


### **Inducibly overexpress human p25 in mice**:
  * Human p25 gene placed under control of calcium/calmodulin-dependent protein kinase II (CK) promoter
  * Mice raised on doxycycline-supplemented diet until brains are developed
  * Mice given doxycycline-free diet
  * p25 is Overexpressed

2006 | Cruz JC, Kim D, Moy LY, Dobbin MM, Sun X, Bronson RT, Tsai, LH
_p25/Cyclin-Dependent Kinase 5 Induces Production and Intraneuronal Accumulation of Amyloid β In Vivo_

# AD mouse model induced
<img src="images/induced_mouse.png" height="200" width="400">

2015 | Gjoneska E, Pfenning AR, Mathys H, Quon G, Kundaje A, Tsai LH & Kellis M
_Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease_

# Gene Expression Examined
<img src="images/induced_hippocampus_col.png" height="600">

2015 | Gjoneska E, Pfenning AR, Mathys H, Quon G, Kundaje A, Tsai LH & Kellis M
_Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease_

# Grouping

# List of 12 GO Terms
<img src="images/az_ungrouped_bw.png" height="560">
2018 Klopfenstein DV, Zhang L, Pedersen BS, ... Tang H. Scientific Reports

# Some GO terms are related
![Unsorted Grouped GO Terms](images/az_ungrouped_rgb.png)
2018 Klopfenstein DV, Zhang L, Pedersen BS, ... Tang H. Scientific Reports

# Group related GO terms
![Group/Sorted GO terms](images/az_grouped_rgb.png)
2018 Klopfenstein DV, Zhang L, Pedersen BS, ... Tang H. Scientific Reports

# Group using a Sections File

# Flexible Grouping: Research can guide choices
<img src="images/fig2.png" height="700" width="700">
2018 Klopfenstein DV, Zhang L, Pedersen BS, ... Tang H. Scientific Reports

# Generate a new Sections File

**Running this**:
```
$ scripts/wr_sections.py goids.txt
hdr GOs(0 in 0 sections,  61 unused) WROTE: sections_in.txt
hdr GOs(0 in 0 sections,  61 unused) WROTE: sections.txt
usr GOs(0 in 0 sections, 840 ungrpd) WROTE: grouped_gos.txt
   840 user GO IDs

```

**Creates sections_in.txt (ASCII text file)**:
```
# SECTION: Misc.
GO:0008150  # BP ** 29625 L00 D00 R00   biological_process
GO:0032502  # BP **  6473 L01 D01 R01 E developmental process
GO:0050896  # BP **  6004 L01 D01 R01 F response to stimulus
GO:0002376  # BP **  1796 L01 D01 R01 K immune system process
GO:0048646  # BP **   878 L02 D02 R02 E anatomical structure ...
```


# Add User-defined Sections

**Original sections file:**
```
# SECTION: Misc.
GO:0008150  # BP ** 29625 L00 D00 R00   biological_process
GO:0032502  # BP **  6473 L01 D01 R01 E developmental process
GO:0050896  # BP **  6004 L01 D01 R01 F response to stimulus
GO:0002376  # BP **  1796 L01 D01 R01 K immune system process
GO:0048646  # BP **   878 L02 D02 R02 E anatomical structure ...
```

**User-defined sections:**
```
# SECTION: immune
GO:0002376  # BP **  1796 L01 D01 R01 K immune system process

# SECTION: development
GO:0032502  # BP **  6473 L01 D01 R01 E developmental process
GO:0048646  # BP **   878 L02 D02 R02 E anatomical structure ...

# SECTION: stimulus
GO:0050896  # BP **  6004 L01 D01 R01 F response to stimulus

# SECTION: Misc.
GO:0008150  # BP ** 29625 L00 D00 R00   biological_process
```

<img src="images/fig2.png" height="700" width="700">
2018 Klopfenstein DV, Zhang L, Pedersen BS, ... Tang H. Scientific Reports

# Move _germinal center formation_
Move _germinal center formation_ from the **development** section to the **immune** section:
```
# SECTION: immune
GO:0002376  # BP **  1796 L01 D01 R01 K immune system process
GO:0006955  # immune response

# SECTION: development
GO:0032502  # BP **  6473 L01 D01 R01 E developmental process
GO:0048646  # BP **   878 L02 D02 R02 E anatomical structure ...

# SECTION: stimulus
GO:0050896  # BP **  6004 L01 D01 R01 F response to stimulus

# SECTION: Misc.
GO:0008150  # BP ** 29625 L00 D00 R00   biological_process
```

# Compare GOATOOLS GOEAs to Other Tools

* GO Terms returned: Specific vs Broad
* Genes Associated with GO Terms
* P-values across tools
* Genes returned across tools


# GO Terms enriched: Specific vs Broad
<img src="images/cmp1_specific_vs_broad.png" height="500" width="500">
2018 Klopfenstein DV, Zhang L, Pedersen BS, ... Tang H. Scientific Reports

# P-values
<img src="images/cmp2_pvals.png" height="500" width="500">
2018 Klopfenstein DV, Zhang L, Pedersen BS, ... Tang H. Scientific Reports

# Genes Associated with GO Terms
<img src="images/cmp3_assc.png" height="650" width="650">
2018 Klopfenstein DV, Zhang L, Pedersen BS, ... Tang H. Scientific Reports

# Genes returned
![genes returned](images/suppfig4.png)
2018 Klopfenstein DV, Zhang L, Pedersen BS, ... Tang H. Scientific Reports

#  GOATOOLS found more statistically signiﬁcant GO terms than found by DAVID6.8 when using the same annotations
![GOATOOLS and DAVID](images/goatools_v_david6p8.png)

# Stochastic GOEA Simulations with GOATOOLS

# Remove 30 Broad GOs from 17,000 total
![sim_p0](images/sim_p0.png)

<img src="images/fig2.png" height="700" width="700">
2018 Klopfenstein DV, Zhang L, Pedersen BS, ... Tang H. Scientific Reports

# Turn Propagate Counts On
![sim_p1](images/sim_p1.png)

# Citations
![Citations according to Google Scholar](images/citations_2019_01_04.png)

Stats downloaded Feb 2019

# Over 200 GitHub stars
![GitHub Stars](images/githubstars_2019_01_04.png)

<img src="images/goatools_stargazers.png" width="2500">


# GOATOOLS Users' Research
* **Firefly genomes**: Evolutionary origin of bioluminescence
* **Cultured Pearl Quality**
* How light affects **cyanobacterium Synechocystis** growth
* **Zebrafish** database: ZFLNC
* **maize**: Find causal genetic loci using HT phenotyping
* **Circadian Analysis**
* **Giant Clams** and its symbiont **dinoflagellates** respond to rising sea temperature
* **Yeast** to study Statin Drug Response

* Examine two firefly species that diverged over 100 million years ago

# Conclusion

The authors were able to show:
  - GOATOOLS is as good as, if not better than other GOEA tools
  - GOEA sensitivity is high (genes found are correct)
      - Propagate counts ON
      - Study sizes of 20+ gene products
  - Novel GO grouping

# Possible Uses in Dampier Lab

- GOEAs
- GO terms can be features for gene products in machine learning applications
- GO grouping for feature extraction for better ML results


Feature extraction is a dimensionality reduction process, 
where an initial set of raw variables is 
reduced to more manageable groups (features) for processing, 
while still accurately and completely describing the original data set