# Genes to Phenotypes and Phenotypes to Genes

This notebook allows you to query the gene and phenotype annotations in the IDR. 

**1. Gene to Phenotype queries**
You can query with a gene list or GO ontology ID for a protein complex/cellular component which is then queried for the associated gene members and get back phenotypes associated with those genes. 


**2. Phenotype to Gene queries**
You can query with a phenotype name and get back a list of all the genes linked 
to that phenotype.  You can then perform a GO enrichment analysis to see if the genes
are signficantly associated with any GO terms.




### First set up your environment and import all the packages you will need

In [1]:
# import pandas for data manipulation
import pandas

# import some functions for displaying the results
from ipywidgets import widgets, interact, fixed
from functools import wraps
from IPython.display import display, HTML

# import functions from the idr package to perform the queries
from idr import connection, create_http_session
from idr import genes_of_interest_go
from idr.widgets import textbox_widget, progress
from idr.widgets import dropdown_widget
from idr import get_phenotypes_for_genelist, get_similar_genes
from idr import attributes_by_attributes, get_organism_screenids

# import a package for doing GO term enrichement analysis
import gseapy as gp

### Create a session for querying the IDR

In [99]:
session = create_http_session()

### Set some variables to say that are going to only look at human genes

In [100]:
organism = 'Homo sapiens'
tax_id = '9606'


<br>
## Genes to Phenotypes query

### Create boxes to enter the Gene Ontology ID or manual gene list

In [30]:
go_term = textbox_widget('', 'Enter GO Id e.g. GO:0005885', 'Gene Ontology Id:', True)
manual_gene_list = textbox_widget('','Comma seperated gene symbols', 'Manual Gene List:', True)

# ENTER VALUES IN ONE OR BOTH BOXES THEN MOVE TO THE NEXT CELL

A Jupyter Widget

A Jupyter Widget

### Query GO for the genes associated with the GO ID, combine with any manually entered gene list

In [32]:
go_gene_list = []
if go_term.value.split(",") != ['']:
    go_gene_list = genes_of_interest_go(go_term.value, tax_id)
else:
    print 'Please enter a valid Gene Ontology Id'
manual_list = manual_gene_list.value.split(",")
if manual_list != ['']:
    go_gene_list = list(set(go_gene_list + manual_list))
    
print "Query list of genes:", go_gene_list

Please enter a valid Gene Ontology Id
Query list of genes: [u' HELZ2', u'ASH2L']


### Query the IDR to get all the phenotypes associated with the list of genes

In [34]:
# display the results in a table
# set the columns so that all values will be shown
old_width = pandas.get_option('display.max_colwidth')
pandas.set_option('display.max_colwidth', -1)

# then display the table
[query_genes_dataframe, screen_to_phenotype_dictionary] = get_phenotypes_for_genelist(session, go_gene_list, organism)
display(HTML(query_genes_dataframe.to_html( escape=False)))



Unnamed: 0,Entrez,Ensembl,Key,Value,PhenotypeName,PhenotypeAccession,ScreenIds
HELZ2,[85441],[],EntrezID,85441,"[elongated cell phenotype, cell with projections]","[CMPO_0000077, CMPO_0000071]","[1202, 1202]"
ASH2L,[9070],[ENSG00000129691],GeneName,ASH2L,[elongated cell phenotype],[CMPO_0000077],[1202]


---
## Phenotypes to genes query

### Choose a phenotype that we'd like to query
Enter a Phenotype Term Name E.g. 'elongated cell phenotype'


In [101]:
phenotype = 'elongated cell phenotype'

### We have to first write a function to query the IDR for the genes associated with a phenotype

In [102]:
def get_genes_for_phenotype(phenotype, conn, sid): 
    args = {
            "name": "Phenotype Term Name",
            "value": phenotype,
            "ns": "openmicroscopy.org/mapr/phenotype",
            "ns2": "openmicroscopy.org/mapr/gene",
            "s_id": sid
        }

    cc = attributes_by_attributes(conn, **args)
    dataframe = pandas.DataFrame.from_dict(cc)
       
    return dataframe



### Then we set up the connection to the IDR

In [103]:
conn = connection(host='idr.openmicroscopy.org', user='public', password='public', port=4064)
idr_base_url='https://idr.openmicroscopy.org'

Connected to IDR...


### Then we get a list of all the screens with human data

In [104]:
sid_list = get_organism_screenids(session, organism, idr_base_url)

print ("The IDR identifiers for the screens with human genes are:", sid_list)

('The IDR identifiers for the screens with human genes are:', ['102', '253', '206', '251', '803', '1351', '1202', '1101', '1302', '1251', '1151', '1203', '1204', '1851', '1651', '1652', '1653', '1654', '1801', '1751', '1901', '1952'])


### Go through each screen and find the genes associated with the selected phenotype

In [105]:
for i, sid in enumerate(set(sid_list)):
    df = get_genes_for_phenotype(phenotype, conn, sid)
    progress(i+1, len(set(sid_list)), status='Iterating through screens')
    if df.empty:
        continue

    gene_list = []
    for x in df.iloc[:, 0]:

        key = x[0]
        value = x[1]

        if key == "Gene Symbol":
            genesym = value
            gene_list.append(genesym)
    

print gene_list

['ALS2', 'ARAP2', 'ARFGEF3', 'ARHGAP10', 'ARHGAP11A', 'ARHGAP17', 'ARHGAP18', 'ARHGAP20', 'ARHGAP27', 'ARHGAP30', 'ARHGAP31', 'ARHGAP33', 'ARHGAP39', 'ARHGAP4', 'ARHGEF11', 'ARHGEF12', 'ARHGEF15', 'ARHGEF16', 'ARHGEF2', 'ARHGEF25', 'ARHGEF28', 'ARHGEF38', 'ARHGEF5', 'ARHGEF7', 'ARHGEF9', 'CDC42', 'CHN2', 'DEF6', 'DOCK1', 'DOCK4', 'DOCK5', 'DOCK9', 'FAM13A', 'FARP2', 'FGD2', 'FGD3', 'FGD6', 'GMIP', 'HMHA1', 'IGDCC4', 'ITSN1', 'MCF2L2', 'NET1', 'OCRL', 'OPHN1', 'PLEKHG1', 'PLEKHG2', 'PLEKHG6', 'PLEKHG7', 'PREX2', 'RAC1', 'RAC2', 'RALBP1', 'RHOA', 'RHOC', 'RHOV', 'RND2', 'RND3', 'SRGAP3', 'SWAP70', 'SYDE2', 'TAGAP', 'VAV1', 'VAV3']


### Now we can perform enrichment analysis on the the gene set associated with the phenotype 
It will find which terms are over-represented using annotations for that gene set.
We are going to use the python package called GSEApy to do this.  It performs gene set enrichment analysis by calling the enrichr online api. See  http://pythonhosted.org/gseapy/run.html for more information about the package, and http://amp.pharm.mssm.edu/Enrichr/ for more information about Enrichr.

### Set which database to query 
You can pick on the list shown on this web page http://amp.pharm.mssm.edu/Enrichr/#stats

Good ones to try are: <br>
* GO_Cellular_Component_2017b
* GO_Biological_Process_2017b
* GO_Molecular_Function_2017b
* Reactome_2016

In [107]:
databaseToQuery = 'GO_Cellular_Component_2017b'

### Create a function to do the enrichment analysis and then print out the top 5 enriched categories

In [108]:
def getenrichedterms(df, cutoff=0.05):
    enrichedlist = []
    for GOterm in df.iterrows():
        pvalue = df.loc[int(GOterm[0]),'Adjusted P-value']
        if pvalue <= cutoff:
            term = df.loc[GOterm[0],'Term']
            start1 = term.index('(')+1
            end1 = term.index(')')
            enrichedlist.append(term[start1:end1])
    return enrichedlist

    
print '\033[1m' + "query genes:" + '\033[0m'
enr = gp.enrichr(gene_list, gene_sets=databaseToQuery, cutoff=0.05, no_plot=False)
gsea_results= enr.res2d
gsea_results = gsea_results.sort_values('P-value', ascending=True)
display(gsea_results.head())


[1mquery genes:[0m


Unnamed: 0,Term,Overlap,P-value,Adjusted P-value,Old P-value,Old Adjusted P-value,Z-score,Combined Score,Genes
0,focal adhesion,7/356,0.000139,0.012901,9.6084e-07,8.9e-05,-3.045764,27.055581,CDC42;RAC2;ARHGEF2;RAC1;ARHGEF7;RND3;RHOA
2,chromatoid body,2/32,0.0047,0.141577,0.001539217,0.038897,-1.836405,9.843502,CDC42;RAC1
3,cytoplasmic ribonucleoprotein granule,2/32,0.0047,0.141577,0.001539217,0.038897,-1.835543,9.838881,CDC42;RAC1
4,P granule,2/37,0.006246,0.141577,0.002022967,0.038897,-1.923128,9.761333,CDC42;RAC1
5,actin filament branch point,2/45,0.009134,0.141577,0.002927729,0.038897,-1.920555,9.018449,RAC2;RAC1


**Note** 
Information about the values in each column can be found here http://amp.pharm.mssm.edu/Enrichr/help#background

    