# Querying the Gene Ontology Database

In this tutorial, we'll look at several questions that utilize the Gene Ontology 
(GO) database and build SQL queries to answer these questions. In each case, 
we'll work step-by-step to build the final query. When designing your own 
queries, you might want to work step-by-step as well; but your final work should 
only contain the final query you come up with.

## GO Database

The GO database contains information about biological terms (GO terms). GO terms 
include biological processes, molecular functions, and cellular components 
(localization); such as 'nucleus', 'apoptosis', 'positive regulation of 
inflammatory response'. The GO database also provides associations between genes 
and GO terms. GO database is one of the primary resources researchers use to 
look up the functionality of genes, as described by GO terms.

The tables used in this tutorial include:
- `species`
- `gene_product` - Information about genes
- `term` - Where GO terms are listed
- `graph_path` - Where relationships between terms are listed
- `association` - which acts as a join table between terms and gene_products

GO terms are organized in a *hierarchical structure*, e.g., the term 'nucleus' 
would encompass other terms like 'nuclear chromosome', 'nuclear membrane', etc.
The relationship between terms is represented in the `graph_path` table that 
joins different terms. In order to find all genes associated with the term
'nucleus', you need to first collect all the other terms encompassed under
'nucleus' by collecting them from the `graph_path` table, and then using the 
association table to join all these terms with the genes they are associated
with.

The database schema can also be found at: http://www-legacy.geneontology.org/images/diag-godb-er.jpg

![GO database ER diagram](godb.er.png)

In [1]:
# Notebook set up
import os, sys, pathlib
sys.path.append(os.environ['BMESAHMETDIR']); 
import pandas as pd
import sqlite3
import bmes
from urllib.request import urlretrieve


# Class to interact with the database
class SQLite():
    def __init__(self, file: pathlib.Path):
        assert file.exists(), f"File {file} does not exist"
        assert file.name.endswith('.sqlite') or file.name.endswith('.db'), \
            f"File {file} is not a SQLite database"
        
        self.file = file

    def __enter__(self):
        self.conn = sqlite3.connect(self.file)
        self.conn.row_factory = sqlite3.Row
        return self.conn.cursor()

    def __exit__(self, type, value, traceback):
        self.conn.commit()
        self.conn.close()

    def __repr__(self) -> str:
        return f"SQLite({self.file})"

    def __str__(self) -> str:
        return f"SQLite({self.file})"

    def connect(self):
        self.conn = sqlite3.connect(self.file);
        self.conn.row_factory = sqlite3.Row
        self.cursor = self.conn.cursor()
        return self.cursor

    def commit(self):
        self.conn.commit()

    def execute(self, query):
        if 'select' in query.lower():
            return self.select(query)
        else:
            self.cursor.execute(query)

    def disconnect(self):
        self.conn.close()

    def select(self, query):
        self.cursor.execute(query)
        rows = self.cursor.fetchall()

        if len(rows) == 0:
            print("No rows returned for query")
            return None
        else:
            df = pd.DataFrame(rows)
            df.columns = [col[0] for col in self.cursor.description]
            return df

In [2]:
# Download the database
url = "http://sacan.biomed.drexel.edu/ftp/binf/godb.sqlite"
DBPATH = pathlib.Path('/mnt/z/db/godb.db')
if not DBPATH.exists():
    urlretrieve(url, DBPATH);

# Set up database connection
godb = SQLite(DBPATH)
godb.connect();

In [3]:
# Let's look at the tables in the database
godb.execute("SELECT name FROM sqlite_master WHERE type='table';")

Unnamed: 0,name
0,sqlite_sequence
1,species
2,biodbbuild
3,gene_product
4,association
5,term
6,graph_path


## Queries

### Retrieve the Names of Species that are Under the Genus 'Drosophila'

In [4]:
godb.select("""
SELECT genus, species FROM species
    WHERE genus = "Drosophila"
    LIMIT 0, 10
""")

Unnamed: 0,genus,species
0,Drosophila,sejuncta
1,Drosophila,poonia
2,Drosophila,guttifera
3,Drosophila,divaricata
4,Drosophila,inciliata
5,Drosophila,guayllabambae
6,Drosophila,bicornuta
7,Drosophila,cf. clefta BCW-2006
8,Drosophila,pallidifrons
9,Drosophila,flavomontana


### Retrieve the Genus and Species Name of the Organism whose Species Name has a Prefix 'mel' [LIKE Clause]

In [5]:
godb.select("""
SELECT genus, species 
    FROM species
    WHERE species LIKE "mel%"
    LIMIT 0, 10
""")

Unnamed: 0,genus,species
0,Alnicola,melinoides
1,Hyaloptila,melanosoma
2,Xylaria,mellissii
3,Trichaster,melanocephalus
4,Pomaria,melanosticta
5,Stomolophus,meleagris
6,Isoetes,melanospora
7,Gladiolus,meliusculus
8,Brucella,melitensis CNGB 1120
9,Eleocharis,melanocarpa


### Retrieve the Gene Symbols of all *Drosophila melanogaster* Genes that are Annotated to the Term 'nucleus'

* Gene symbols - Use `gene_product.symbol`
* *Drosophila melanogaster* - Use `species.genus` and `species.species`

Below, we'll progressively build the query, joining with more and more related 
tables until all the required tables are referenced. It is easier to begin with
the explicit query constraints and then add the implicit constraints.  
The last query below shows the final query and its result.

Here's the approach we will use to build the query:

- Find GO terms associated with term of interest:   
  `term.id -> graph_path.(term1_id / term2_id)`
- Find associated gene products  
  `graph_path.(term1_id / term2_id) -> association.(term_id / gene_product_id) -> gene_product.id`
- Find Gene Symbols for gene products
  `gene_product.id -> gene_product.species_id -> gene_product.symbol`

In [6]:
# Look at gene_product table
godb.select("""
SELECT * FROM gene_product
    WHERE full_name != ""
    LIMIT 0, 5
""")

Unnamed: 0,id,symbol,dbxref_id,species_id,type_id,full_name
0,190680,DDB_G0267178_RTE,291581,1333934,45235,DIRS1 ORF2 fragment
1,190681,DDB_G0267182_RTE,291583,1333934,45235,DIRS1 ORF2 fragment
2,190682,DDB_G0267188_RTE,291584,1333934,45235,DIRS1 ORF2 fragment
3,190683,DDB_G0267206_RTE,291585,1333934,45235,DIRS1 ORF2 fragment
4,190684,DDB_G0267210_RTE,291586,1333934,45235,DIRS1 ORF2 fragment


#### Get all Drosohila melanogaster genes

In [7]:
# Join gene_product and species and find gene products for Drosophila
# melanogaster
# Get all gens in Drosophilla melanogaster
godb.select("""
SELECT gene_product.*
    FROM gene_product, species
    WHERE
        species.genus = "Drosophila" AND
        species.species = "melanogaster" AND
        gene_product.species_id = species.id
    LIMIT 0, 5
""")

Unnamed: 0,id,symbol,dbxref_id,species_id,type_id,full_name
0,203433,064Ya,315234,882382,45231,064Ya
1,203434,10-4,315237,882382,45231,10-4
2,203435,11,315239,882382,45231,11
3,203436,128up,315241,882382,45236,upstream of RpIII128
4,203437,14-3-3epsilon,315246,882382,45236,14-3-3epsilon


In [8]:
# Look at GO terms
godb.select("""
SELECT * FROM term LIMIT 1000, 5
""")

Unnamed: 0,id,name,term_type,acc,is_obsolete,is_root,is_relation
0,1001,accumulation of oxidatively modified proteins ...,biological_process,GO:0001317,0,0,0
1,1002,formation of oxidatively modified proteins inv...,biological_process,GO:0001318,0,0,0
2,1003,inheritance of oxidatively modified proteins i...,biological_process,GO:0001319,0,0,0
3,1004,age-dependent response to reactive oxygen spec...,biological_process,GO:0001320,0,0,0
4,1005,age-dependent general metabolic decline involv...,biological_process,GO:0001321,0,0,0


**NOTE:** When doing string comparisons, if the `=` sign is used, the comparison
will be done case-sensitively, unless the `COLLATE` property is set to `NOCASE`.
However, if the `LIKE` operator is used, the comparison will be done 
case-insensitively by default.

In [9]:
# Look at terms that contain the word 'nucleus'
godb.select("""
SELECT * FROM term
	WHERE name LIKE "%nucleus%"
	LIMIT 0, 5
""")

Unnamed: 0,id,name,term_type,acc,is_obsolete,is_root,is_relation
0,90,ribosomal subunit export from nucleus,biological_process,GO:0000054,0,0,0
1,91,ribosomal large subunit export from nucleus,biological_process,GO:0000055,0,0,0
2,92,ribosomal small subunit export from nucleus,biological_process,GO:0000056,0,0,0
3,93,"protein import into nucleus, docking",biological_process,GO:0000059,0,0,0
4,94,"protein import into nucleus, translocation",biological_process,GO:0000060,0,0,0


In [10]:
# Look for the term 'nucleus'
godb.select("""
SELECT * FROM term
    WHERE name = "nucleus"
    LIMIT 0, 5
""")

Unnamed: 0,id,name,term_type,acc,is_obsolete,is_root,is_relation
0,4673,nucleus,cellular_component,GO:0005634,0,0,0


#### Get all associated GO terms

The query below will retrieve all GO terms associated with the term 'nucleus'.

In [11]:
# Perform a join to get associated terms
# Here, we join the `term` table with itself to get the term information
# for all associated terms

godb.select("""
SELECT T2.*
    FROM term as T1, term as T2, graph_path as GP
    WHERE
        T1.name = "nucleus" AND
        GP.term1_id = T1.id AND
        T2.id = GP.term2_id
    LIMIT 0, 5
""")

Unnamed: 0,id,name,term_type,acc,is_obsolete,is_root,is_relation
0,181,nuclear ubiquitin ligase complex,cellular_component,GO:0000152,0,0,0
1,203,nuclear exosome (RNase complex),cellular_component,GO:0000176,0,0,0
2,260,nuclear chromosome,cellular_component,GO:0000228,0,0,0
3,263,condensed nuclear chromosome,cellular_component,GO:0000794,0,0,0
4,604,"condensed nuclear chromosome, centromeric region",cellular_component,GO:0000780,0,0,0


#### Link Associated Terms to Gene Product IDs

The query below pulls records showing the corresponding gene product ids for GO
terms associated with the 'nucleus' term.

In [12]:
# Pull records showing the associated gene product ids and term ids for
# terms associated with the nucleus

godb.select("""
SELECT A.id, A.term_id, A.gene_product_id
    FROM
        term as T1,
        graph_path as GP,
        term as T2,
        association as A
    WHERE
        T1.name = "nucleus" AND
        GP.term1_id = T1.id AND
        T2.id = GP.term2_id AND
        A.term_id = T2.id
    LIMIT 0, 5
""")

Unnamed: 0,id,term_id,gene_product_id
0,631718,181,147756
1,983471,181,204446
2,987029,181,205622
3,1004874,181,210318
4,1004896,181,210319


#### Link GO Terms to the Gene Product Table

In [15]:
godb.select("""
SELECT GENE.*
    FROM
        term as T1,
        graph_path as GP,
        term as T2,
        association as A,
        gene_product as GENE
    WHERE
        T1.name = "nucleus" AND
        GP.term1_id = T1.id AND
        T2.id = GP.term2_id AND
        A.term_id = T2.id AND
        GENE.id = A.gene_product_id
    LIMIT 0, 5
""")

Unnamed: 0,id,symbol,dbxref_id,species_id,type_id,full_name
0,147756,NOT4,247071,95231,45231,
1,204446,CG11261,318046,882382,45236,
2,205622,CG15800,319580,882382,45236,
3,210318,Cul1,318048,882382,45236,Cullin 1
4,210319,Cul2,326543,882382,45236,Cullin 2


#### Look for Gene Products Specific to *Drosophila melanogaster*

In [16]:
godb.select("""
SELECT GENE.*
    FROM
        term as T1,
        graph_path as GP,
        term as T2,
        association as A,
        gene_product as GENE,
        species as S
    WHERE
        T1.name = "nucleus" AND
        GP.term1_id = T1.id AND
        T2.id = GP.term2_id AND
        A.term_id = T2.id AND
        GENE.id = A.gene_product_id AND
        S.genus = "Drosophila" AND
        S.species = "melanogaster" AND
        GENE.species_id = S.id
    LIMIT 0, 5
""")

Unnamed: 0,id,symbol,dbxref_id,species_id,type_id,full_name
0,203437,14-3-3epsilon,315246,882382,45236,14-3-3epsilon
1,203437,14-3-3epsilon,315246,882382,45236,14-3-3epsilon
2,203437,14-3-3epsilon,315246,882382,45236,14-3-3epsilon
3,203438,14-3-3zeta,315264,882382,45236,14-3-3zeta
4,203438,14-3-3zeta,315264,882382,45236,14-3-3zeta


#### Final Query

In [21]:
results = godb.select("""
SELECT DISTINCT(GENE.symbol)
    FROM
        term as T1,
        graph_path as GP,
        term as T2,
        association as A,
        gene_product as GENE,
        species as S
    WHERE
        T1.name = "nucleus"  AND
        GP.term1_id = T1.id  AND
        T2.id = GP.term2_id  AND
        A.term_id = T2.id  AND
        GENE.id = A.gene_product_id  AND
        S.genus = "Drosophila"  AND
        S.species = "melanogaster"  AND
        GENE.species_id = S.id
""")

# Print the results
print("Number of genes in Drosophila melanogaster associated with the nucleus: ", 
      results.shape[0])
print("The first 10 of these genes are listed below:")
print(', '.join(results['symbol'].head(10)))

Number of genes in Drosophila melanogaster associated with the nucleus:  2377
The first 10 of these genes are listed below:
14-3-3epsilon, 14-3-3zeta, 2.1, 33-13, 4E-T, ADD1, AGO1, AGO2, AMPKalpha, AP-1mu
