# Genes to Phenotypes and Phenotypes to Genes

This notebook allows you to query the gene and phenotype annotations in the IDR. 

**1. Gene to Phenotype queries**
You can query with a gene list or GO ontology ID for a protein complex/cellular component which is then queried for the associated gene members and get back phenotypes associated with those genes. 


**2. Phenotype to Gene queries**
You can query with a phenotype name and get back a list of all the genes linked 
to that phenotype.  You can then perform a GO enrichment analysis to see if the genes
are signficantly associated with any GO terms.




### First set up your environment and import all the packages you will need

In [1]:
# import pandas for data manipulation
import pandas

# import some functions for displaying the results
from ipywidgets import widgets, interact, fixed
from functools import wraps
from IPython.display import display, HTML

# import functions from the idr package to perform the queries
from idr import connection, create_http_session
from idr import genes_of_interest_go
from idr.widgets import textbox_widget, progress
from idr.widgets import dropdown_widget
from idr import get_phenotypes_for_genelist, get_similar_genes
from idr import attributes_by_attributes, get_organism_screenids

# import a package for doing GO term enrichement analysis
import gseapy as gp

### Create a session for querying the IDR

In [2]:
session = create_http_session()

### Set some variables to say that are going to only look at human genes

In [3]:
organism = 'Homo sapiens'
tax_id = '9606'


<br>
## Genes to Phenotypes query

### Create boxes to enter the Gene Ontology ID or manual gene list

In [5]:
go_term = textbox_widget('', 'Enter GO Id e.g. GO:0005885', 'Gene Ontology Id:', True)
manual_gene_list = textbox_widget('','Comma separated gene symbols', 'Manual Gene List:', True)

# ENTER VALUES IN ONE OR BOTH BOXES THEN MOVE TO THE NEXT CELL WITHOUT RUNNING THE CELL AGAIN

A Jupyter Widget

A Jupyter Widget

### Query GO for the genes associated with the GO ID, combine with any manually entered gene list

In [6]:
go_gene_list = []
if go_term.value.split(",") != ['']:
    go_gene_list = genes_of_interest_go(go_term.value, tax_id)
else:
    print 'Please enter a valid Gene Ontology Id'
manual_list = manual_gene_list.value.split(",")
if manual_list != ['']:
    go_gene_list = list(set(go_gene_list + manual_list))
    
print "Query list of genes:", go_gene_list

Please enter a valid Gene Ontology Id
Query list of genes: [u' HELZ2', u'ASH2L']


### Query the IDR to get all the phenotypes associated with the list of genes

In [7]:
# display the results in a table
# set the columns so that all values will be shown
pandas.set_option('display.max_colwidth', -1)

# then display the table
[query_genes_dataframe, screen_to_phenotype_dictionary] = get_phenotypes_for_genelist(session, go_gene_list, organism)
display(HTML(query_genes_dataframe.to_html( escape=False)))



Unnamed: 0,Entrez,Ensembl,Key,Value,PhenotypeName,PhenotypeAccession,ScreenIds
HELZ2,[85441],[],EntrezID,85441,"[elongated cell phenotype, cell with projections]","[CMPO_0000077, CMPO_0000071]","[1202, 1202]"
ASH2L,[9070],[ENSG00000129691],GeneName,ASH2L,[elongated cell phenotype],[CMPO_0000077],[1202]


---
## Phenotypes to genes query

### Choose a phenotype that we'd like to query
Enter a Phenotype Term Name E.g. elongated cell phenotype

In [47]:
Phenotype = textbox_widget('elongated cell phenotype', '', 'Phenotype', True)
# ENTER A VALUE IN THE BOX THEN MOVE TO THE NEXT CELL WITHOUT RUNNING THE CELL AGAIN
# THE BOX HAS BE PREPOPULATED WITH A PHENOTYPE VALUE BUT YOU CAN CHANGE THIS

A Jupyter Widget

### We can write a function to query the IDR for the genes associated with a phenotype

In [41]:
def get_genes_for_phenotype(phenotype, conn, sid): 
    args = {
            "name": "Phenotype Term Name",
            "value": phenotype,
            "ns": "openmicroscopy.org/mapr/phenotype",
            "ns2": "openmicroscopy.org/mapr/gene",
            "s_id": sid
        }

    cc = attributes_by_attributes(conn, **args)
    dataframe = pandas.DataFrame.from_dict(cc)
       
    return dataframe


### Then we set up the connection to the IDR

In [42]:
conn = connection()
idr_base_url='https://idr.openmicroscopy.org'

Connected to IDR...


### Then we get a list of all the screens with human data

In [43]:
sid_list = get_organism_screenids(session, organism, idr_base_url)

print ("The IDR identifiers for the screens with human genes are:", sid_list)

('The IDR identifiers for the screens with human genes are:', ['102', '253', '206', '251', '803', '1351', '1202', '1101', '1302', '1251', '1151', '1203', '1204', '1851', '1651', '1652', '1653', '1654', '1801', '1751', '1901', '1952'])


### Go through each screen and find the genes associated with the selected phenotype

In [44]:
phenotype = Phenotype.value
gene_set = set()

for i, sid in enumerate(set(sid_list)):
    df = get_genes_for_phenotype(phenotype, conn, sid)
    progress(i+1, len(set(sid_list)), status='Iterating through screens')
    if df.empty:
        continue

    for x in df.iloc[:, 0]:

        key = x[0]
        value = x[1]

        if key == "Gene Symbol":
            if value != '':
                genesym = value
                gene_set.add(genesym)

gene_list = list(gene_set) 
print gene_list
print "\nThere are", len(gene_list), "unique genes"

['FXYD4', 'ZCCHC17', 'PROCA1', 'PES1', 'ZNF79', 'CCDC140', 'LRFN5', 'USP28', 'C12orf43', 'DGCR8', 'RPF2', 'ZCCHC9', 'RPS3A', 'C3orf17', 'YAF2', 'NOL12', 'CLCN7', 'CENPA', 'RRS1', 'RPL15', 'RPL11', 'RPL13', 'SNAI2', 'GPATCH4', 'GPATCH2', 'RPL19', 'C19orf53', 'C1orf35', 'CCDC86', 'SFN', 'PTMS', 'SURF6', 'C7orf50', 'RRP1', 'MRPL20', 'RPL6', 'NOP2', 'RPL3', 'C1orf25', 'NHP2', 'NOP16', 'ADRBK2', 'AEN', 'GTF2E2', 'DDX52', 'DDX50', 'FTSJ3', 'NARS', 'RBM34', 'C10orf58', 'C1orf131', 'MRPL47', 'C11orf57', 'RPL34', 'CISD2', 'CCDC59', 'MAGEB2', 'PTPN9', 'RSL1D1', 'MKI67IP', 'DIMT1L', 'TNP2', 'NEIL1', 'RPL7L1', 'ELOVL1', 'ZNF501', 'DDX54', 'PSMB10', 'RPS6', 'RPS5', 'RPS8', 'C18orf20', 'NUSAP1', 'RPS15', 'CCDC106', 'NPM1', 'NOC2L', 'UTP11L', 'RPL13A', 'C4orf31', 'RBM19', 'RPS25', 'ZNF323', 'TAL2', 'LOC441242', 'NOL7', 'ZNF800', 'PPAN', 'ACTN4', 'ABT1', 'PRM1', 'KRR1', 'MRTO4', 'MEAF6', 'C3orf49', 'RPS13', 'UTP3', 'RPL23A', 'RPS14', 'TCEB3', 'MRPS15', 'ZMAT4', 'RRP8', 'EIF2S2', 'TCEB3B', 'RPL28', 'NO

### Now we can perform enrichment analysis on the the gene set associated with the phenotype 
It will find which terms are over-represented using annotations for that gene set.
We are going to use the python package called GSEApy to do this.  It performs gene set enrichment analysis by calling the enrichr online api. See  http://pythonhosted.org/gseapy/run.html for more information about the package, and http://amp.pharm.mssm.edu/Enrichr/ for more information about Enrichr.

### Set which database to query 
You can pick on the list shown on this web page http://amp.pharm.mssm.edu/Enrichr/#stats

Good ones to try are: <br>
* GO_Cellular_Component_2017b
* GO_Biological_Process_2017b
* GO_Molecular_Function_2017b
* Reactome_2016

In [45]:
# EDIT THE NEXT LINE IF YOU WANT TO CHANGE THE DATABASE
databaseToQuery = 'GO_Cellular_Component_2017b'

### Create a function to do the enrichment analysis and then print out the top 5 enriched categories

In [46]:
def getenrichedterms(df, cutoff=0.05):
    enrichedlist = []
    for GOterm in df.iterrows():
        pvalue = df.loc[int(GOterm[0]),'Adjusted P-value']
        if pvalue <= cutoff:
            term = df.loc[GOterm[0],'Term']
            start1 = term.index('(')+1
            end1 = term.index(')')
            enrichedlist.append(term[start1:end1])
    return enrichedlist

    
print '\033[1m' + "query genes:" + '\033[0m'
enr = gp.enrichr(gene_list, gene_sets=databaseToQuery, cutoff=0.05, no_plot=False)
gsea_results= enr.res2d
gsea_results = gsea_results.sort_values('P-value', ascending=True)
display(gsea_results.head())


[1mquery genes:[0m


Unnamed: 0,Term,Overlap,P-value,Adjusted P-value,Old P-value,Old Adjusted P-value,Z-score,Combined Score,Genes
0,nucleolar part,45/635,5.74446e-35,7.387746000000001e-33,8.297452e-29,1.056118e-26,-3.815975,300.86004,MRPS15;RPL3;MEAF6;RPS19BP1;RPL11;RRP1;NOP2;NOL7;NOLC1;CDCA8;DDX21;NOC2L;RPF2;RRP8;RPS14;C19ORF53;PES1;RRS1;SURF6;RBM34;UTP14A;LYAR;RPS13;NOP56;ABT1;NPM1;NOP16;UTP3;KRR1;RPS6;RBM19;RPL13A;DDX54;AEN;RPL23A;RPS3A;NPM3;DDX50;FTSJ3;RPS25;RSL1D1;ILF3;DKC1;MRTO4;CCDC86
1,nucleolus,45/637,6.596201e-35,7.387746000000001e-33,9.429629e-29,1.056118e-26,-3.812106,300.027895,MRPS15;RPL3;MEAF6;RPS19BP1;RPL11;RRP1;NOP2;NOL7;NOLC1;CDCA8;DDX21;NOC2L;RPF2;RRP8;RPS14;C19ORF53;PES1;RRS1;SURF6;RBM34;UTP14A;LYAR;RPS13;NOP56;ABT1;NPM1;NOP16;UTP3;KRR1;RPS6;RBM19;RPL13A;DDX54;AEN;RPL23A;RPS3A;NPM3;DDX50;FTSJ3;RPS25;RSL1D1;ILF3;DKC1;MRTO4;CCDC86
2,cytosolic large ribosomal subunit,14/68,8.585105e-18,6.410212e-16,2.364643e-14,1.7656e-12,-2.008388,78.922631,RPL3;RPL32;RPL31;RPL34;RPL11;RPL13A;RPL23A;RPL6;RPL7L1;RPL13;SURF6;RPL15;RPL28;RPL19
3,cytosolic small ribosomal subunit,9/56,7.966611e-11,4.461302e-09,9.388976e-09,5.257826e-07,-2.081733,48.406902,RPS15;RPS25;RPS14;RPS8;RPS5;RPS6;RPL13;RPS3A;RPS13
5,Noc1p-Noc2p complex,6/35,8.514245e-08,3.814382e-06,2.289266e-06,0.0001025591,-1.771295,28.834809,RSL1D1;PES1;RRP1;PPAN;RRS1;NOC2L


**Note** 
Information about the values in each column can be found here http://amp.pharm.mssm.edu/Enrichr/help#background

    

<br>
#### When you are completely finished running the notebook close the connection to IDR

In [17]:
conn.close()

This notebook has been modifed from previous notebooks created by Balaji Ramalingam.