## Querying the Gene Ontology Database

In this tutorial, I am looking at several questions that make use of the Gene Ontology (GO) database and build SQL queries to answer these questions. In each case, I work step-by-step to build the final query. When you are designing your own queries, I recommend that you also work step-by-step; but your final work should only contain the final query that you come up with.

Gene Ontology Database contains information about biological terms (GO terms). GO terms include biological processes, molecular functions, and cellular localization; such as "nucleus", "apoptosis", "positive regulation of inflammatory response". GO database also provides associations between genes to GO terms. GO database is one of the primary resources that researchers use to look up the functionality of genes, as described by GO terms.

The tables used in this tutorial include: species, gene_product (information about genes), term (where GO terms are listed), graph_path (where relationships between terms are listed), association (which acts as a join table between terms and gene_products).

GO terms are organized in a hierarchical structure, e.g., the term "nucleus" would encompass other terms like "nuclear chromosome", "nuclear membrane", etc. The relationship between terms is represented in the graph_path table that joins different terms. In order to find all genes associated with the term "nucleus", you need to first collect all the other terms encompassed under "nucleus" by collecting them from the graph_path table, and then using the association table to join all these terms with the genes that are associated with.


The database schema can also be found at: http://www-legacy.geneontology.org/images/diag-godb-er.jpg

![GO database ER diagram](https://sacan.biomed.drexel.edu/lib/exe/fetch.php?media=course:bcomp2:db:godb.er.png)

In [2]:
import sys,os; sys.path.append(os.environ['BMESAHMETDIR']); import bmes
import sqlite3

from pandas import DataFrame
def myselect(sql):
    cur.execute(sql);
    rows=cur.fetchall();
    if len(rows)==0:
        print('No results returned for SQL query.');
    else:
        df = DataFrame(rows)
        df.columns = [x[0] for x in cur.description]
        display(df) #display() is ipython-specific. In non-ipython script, you can use print(df)

In [3]:
godbfile = bmes.downloadurl('http://sacan.biomed.drexel.edu/ftp/binf/godb.sqlite')
conn = sqlite3.connect(godbfile);
cur = conn.cursor();


### Retrieve the names of the species that are under the genus ‘Drosophila’

In [4]:
myselect('''SELECT *
 FROM species
 WHERE genus="Drosophila"
 LIMIT 0,10''')

Unnamed: 0,id,ncbi_taxa_id,common_name,lineage_string,genus,species,parent_id,left_value,right_value,taxonomic_rank
0,1107,48328,,,Drosophila,sejuncta,1083217,1202877,1202878,
1,1645,937309,,,Drosophila,poonia,621428,1202658,1202659,
2,2828,66368,,,Drosophila,guttifera,78683,1201746,1201747,
3,5255,48369,,,Drosophila,divaricata,127578,1202793,1202794,
4,5841,937288,,,Drosophila,inciliata,997796,1202499,1202500,
5,5921,242880,,,Drosophila,guayllabambae,273046,1201789,1201790,
6,6762,112146,,,Drosophila,bicornuta,199226,1203161,1203162,
7,8847,381459,,,Drosophila,cf. clefta BCW-2006,1468585,1203535,1203536,
8,11490,88884,,,Drosophila,pallidifrons,8889,1202187,1202188,
9,18730,40367,,,Drosophila,flavomontana,1369357,1202374,1202375,


### •Retrieve the genus and species name of the organisms  whose species name has a prefix ‘mel’ (use LIKE function).	

In [5]:
myselect('''SELECT genus,species  
	 FROM species  
	 WHERE species LIKE "mel%"  
	 LIMIT 0,10''')

Unnamed: 0,genus,species
0,Alnicola,melinoides
1,Hyaloptila,melanosoma
2,Xylaria,mellissii
3,Trichaster,melanocephalus
4,Pomaria,melanosticta
5,Stomolophus,meleagris
6,Isoetes,melanospora
7,Gladiolus,meliusculus
8,Brucella,melitensis CNGB 1120
9,Eleocharis,melanocarpa


### •Retrieve the gene symbols (gene_product.symbol) of all Drosophila melanogaster (species.genus, species.species) genes that are annotated to the ‘nucleus’.

Below, I progressively build the query, joining with more and more related tables until all the required tables are referenced. The last query below shows the final query and its result.
* Join path: term.id -> graph_path.term1_id /term2_id -> association.term_id /gene_product_id -> gene_product.id
* Join path: gene_product.species_id -> species.id

In [11]:
myselect('''SELECT *  
	 FROM gene_product  
	 WHERE full_name!=""
     LIMIT 0,5
     ''')

Unnamed: 0,id,symbol,dbxref_id,species_id,type_id,full_name
0,190680,DDB_G0267178_RTE,291581,1333934,45235,DIRS1 ORF2 fragment
1,190681,DDB_G0267182_RTE,291583,1333934,45235,DIRS1 ORF2 fragment
2,190682,DDB_G0267188_RTE,291584,1333934,45235,DIRS1 ORF2 fragment
3,190683,DDB_G0267206_RTE,291585,1333934,45235,DIRS1 ORF2 fragment
4,190684,DDB_G0267210_RTE,291586,1333934,45235,DIRS1 ORF2 fragment


In [11]:
myselect('''SELECT gene_product.*  
	 FROM gene_product, species  
	 WHERE species.genus="Drosophila" AND species="melanogaster"  
	 AND  species.id = gene_product.species_id 
	 LIMIT 0,5''')

Unnamed: 0,id,symbol,dbxref_id,species_id,type_id,full_name
0,203433,064Ya,315234,882382,45231,064Ya
1,203434,10-4,315237,882382,45231,10-4
2,203435,11,315239,882382,45231,11
3,203436,128up,315241,882382,45236,upstream of RpIII128
4,203437,14-3-3epsilon,315246,882382,45236,14-3-3epsilon


In [13]:
myselect('''SELECT *  
	 FROM term 
	 LIMIT 0,5''')

Unnamed: 0,id,name,term_type,acc,is_obsolete,is_root,is_relation
0,1,is_a,relationship,is_a,0,0,1
1,2,consider,metadata,consider,0,0,1
2,3,replaced_by,metadata,replaced_by,0,0,1
3,4,Grouping classes that can be excluded,subset,goantislim_grouping,0,0,0
4,5,Term not to be used for direct annotation,subset,gocheck_do_not_annotate,0,0,0


In [14]:
myselect('''SELECT *  
	 FROM term 
	 WHERE name="nucleus"  
	 LIMIT 0,5''')

Unnamed: 0,id,name,term_type,acc,is_obsolete,is_root,is_relation
0,4673,nucleus,cellular_component,GO:0005634,0,0,0


In [16]:
myselect('''SELECT t2.*  
	 FROM term AS t1, graph_path, term AS t2 
	 WHERE t1.name="nucleus"  
	 AND t1.id = graph_path.term1_id  
	 AND graph_path.term2_id = t2.id 
	 LIMIT 0,5''')

Unnamed: 0,id,name,term_type,acc,is_obsolete,is_root,is_relation
0,181,nuclear ubiquitin ligase complex,cellular_component,GO:0000152,0,0,0
1,203,nuclear exosome (RNase complex),cellular_component,GO:0000176,0,0,0
2,260,nuclear chromosome,cellular_component,GO:0000228,0,0,0
3,263,condensed nuclear chromosome,cellular_component,GO:0000794,0,0,0
4,604,"condensed nuclear chromosome, centromeric region",cellular_component,GO:0000780,0,0,0


In [17]:
myselect('''SELECT association.id, association.term_id, association.gene_product_id 
	 FROM term AS t1, graph_path, term AS t2, association 
	 WHERE t1.name="nucleus"  
	 AND t1.id = graph_path.term1_id  
	 AND graph_path.term2_id = t2.id 
	 AND t2.id = association.term_id  
	 LIMIT 0,5''')

Unnamed: 0,id,term_id,gene_product_id
0,631718,181,147756
1,983471,181,204446
2,987029,181,205622
3,1004874,181,210318
4,1004896,181,210319


In [18]:
myselect('''SELECT gene_product.*  
	 FROM term AS t1, graph_path, term AS t2, association, gene_product 
	 WHERE t1.name="nucleus"  
	 AND t1.id = graph_path.term1_id  
	 AND graph_path.term2_id = t2.id 
	 AND t2.id = association.term_id  
	 AND association.gene_product_id = gene_product.id  
	 LIMIT 0,5''')

Unnamed: 0,id,symbol,dbxref_id,species_id,type_id,full_name
0,147756,NOT4,247071,95231,45231,
1,204446,CG11261,318046,882382,45236,
2,205622,CG15800,319580,882382,45236,
3,210318,Cul1,318048,882382,45236,Cullin 1
4,210319,Cul2,326543,882382,45236,Cullin 2


In [20]:
myselect('''SELECT gene_product.*  
	 FROM term AS t1, graph_path, term AS t2, association, gene_product, species 
	 WHERE t1.name="nucleus"  
	 AND t1.id = graph_path.term1_id  
	 AND graph_path.term2_id = t2.id 
	 AND t2.id = association.term_id  
	 AND association.gene_product_id = gene_product.id  
	 AND species.genus="Drosophila" AND species="melanogaster"  
	 AND  species.id = gene_product.species_id 
	 LIMIT 0,5''')

Unnamed: 0,id,symbol,dbxref_id,species_id,type_id,full_name
0,203437,14-3-3epsilon,315246,882382,45236,14-3-3epsilon
1,203437,14-3-3epsilon,315246,882382,45236,14-3-3epsilon
2,203437,14-3-3epsilon,315246,882382,45236,14-3-3epsilon
3,203438,14-3-3zeta,315264,882382,45236,14-3-3zeta
4,203438,14-3-3zeta,315264,882382,45236,14-3-3zeta


In [28]:
#This is the final form of the query to answer the question above:
cur.execute('''SELECT DISTINCT(gene_product.symbol)
	 FROM term AS t1, graph_path, term AS t2, association, gene_product, species 
	 WHERE t1.name="nucleus"  
	 AND t1.id = graph_path.term1_id  
	 AND graph_path.term2_id = t2.id 
	 AND t2.id = association.term_id  
	 AND association.gene_product_id = gene_product.id  
	 AND species.genus="Drosophila" AND species="melanogaster"  
	 AND  species.id = gene_product.species_id ''')

rs=cur.fetchall();
print('Number of genes found: [%d]. The first 10 of these genes are listed below:'%(len(rs)))
print(', '.join([x[0] for x in rs[0:10]]))

Number of genes found: [2377]. The first 10 of these genes are listed below:
14-3-3epsilon, 14-3-3zeta, 2.1, 33-13, 4E-T, ADD1, AGO1, AGO2, AMPKalpha, AP-1mu
