# G2P Genotype to Phenotype example notebook.

This example illustrates how to access the GA4GH Genotype to Phenotype service.
The GA4GH G2P allows the researcher to query features, phenotypes and their associated evidence.

The examples are based on the compliance server data set with a full complement of [G2P data](https://github.com/ohsu-computational-biology/ohsu-server-util/blob/master/cgd-08-09-2016.ttl) from the Monarch project.


## Initialize the client
In this step we create a client object which will be used to communicate with the server. It is initialized using the URL.

In [34]:
import ga4gh_client.client as client
ga4gh_endpoint = "http://1kgenomes.ga4gh.org"
c = client.HttpClient(ga4gh_endpoint)

## Search for feature set and phenotype association set identifiers
This call returns phenotype association sets hosted by the API. Observe that we are querying all datasets hosted in the endpoint.  The identifiers for the featureset and phenotype association set are used by all subsequent API calls.

In [35]:
datasets = c.search_datasets()
phenotype_association_set_id = None
phenotype_association_set_name = None
for  dataset in datasets:
  phenotype_association_sets = c.search_phenotype_association_sets(dataset_id=dataset.id)
  for phenotype_association_set in phenotype_association_sets:
    phenotype_association_set_id = phenotype_association_set.id
    phenotype_association_set_name = phenotype_association_set.name
    print 'Found G2P phenotype_association_set:', phenotype_association_set.id, phenotype_association_set.name

assert phenotype_association_set_id
assert phenotype_association_set_name

feature_set_id = None
datasets = c.search_datasets()
for  dataset in datasets:
  featuresets = c.search_feature_sets(dataset_id=dataset.id)
  for featureset in featuresets:
    if phenotype_association_set_name in featureset.name:
      feature_set_id = featureset.id
      print 'Found G2P feature_set:', feature_set_id
assert feature_set_id

Found G2P phenotype_association_set: WyIxa2dlbm9tZXMiLCJjZ2QiXQ cgd
Found G2P feature_set: WyIxa2dlbm9tZXMiLCJjZ2QiXQ


## Search for genomic Features  by location

Using the feature set id returned above, the following request returns a list of features that exactly match a location


In [36]:
feature_generator = c.search_features(feature_set_id=feature_set_id,
                        reference_name="chr7",
                        start=55249005,
                        end=55249006
                    )

features = list(feature_generator)
assert len(features) == 1
print "Found {} features in G2P feature_set {}".format(len(features),feature_set_id)
feature = features[0]
print [feature.name,feature.gene_symbol,feature.reference_name,feature.start,feature.end]


Found 1 features in G2P feature_set WyIxa2dlbm9tZXMiLCJjZ2QiXQ
[u'EGFR S768I missense mutation', u'COSM6241', u'chr7', 55249005L, 55249006L]


## Search features by name

If the location is not known, we can query using the name of the feature.  Using the feature set id returned above, the following request returns a list of features that exactly match a given name - `'EGFR S768I missense mutation'`.


In [37]:

feature_generator = c.search_features(feature_set_id=feature_set_id, name='EGFR S768I missense mutation')
features = list(feature_generator)
assert len(features) == 1
print "Found {} features in G2P feature_set {}".format(len(features),feature_set_id)
feature = features[0]
print [feature.name,feature.gene_symbol,feature.reference_name,feature.start,feature.end]



Found 1 features in G2P feature_set WyIxa2dlbm9tZXMiLCJjZ2QiXQ
[u'EGFR S768I missense mutation', u'COSM6241', u'chr7', 55249005L, 55249006L]


## Get evidence associated with that feature.
Once we have looked up the feature, we can then search for all evidence associated with that feature.


In [38]:
feature_phenotype_associations =  c.search_genotype_phenotype(
                                    phenotype_association_set_id=phenotype_association_set_id,
                                    feature_ids=[f.id  for f in features])
associations = list(feature_phenotype_associations)
assert len(associations) >= len(features)
print "There are {} associations".format(len(associations))
print "\n".join([a.description for a in associations])


There are 2 associations
Association: genotype:[EGFR S768I missense mutation] phenotype:[Adenosquamous carcinoma with response to therapy] environment:[irreversible EGFR TKIs] evidence:[response] publications:[http://www.ncbi.nlm.nih.gov/pubmed/22753918|http://www.ncbi.nlm.nih.gov/pubmed/22753918]
Association: genotype:[EGFR S768I missense mutation] phenotype:[Adenosquamous carcinoma with sensitivity to therapy] environment:[MEK inhibitors] evidence:[sensitivity] publications:[http://www.ncbi.nlm.nih.gov/pubmed/23102728|http://www.ncbi.nlm.nih.gov/pubmed/23102728]


## Search a phenotype
Alternatively, a researcher can query for a phenotype.  In this case by the phenotype's description matching `'Adenosquamous carcinoma .*'`


In [39]:
phenotypes_generator = c.search_phenotype(
                phenotype_association_set_id=phenotype_association_set_id,
                description="Adenosquamous carcinoma .*"
                )
phenotypes = list(phenotypes_generator)

assert len(phenotypes) >= 0
print "\n".join(set([p.description for p in phenotypes])) 


Adenosquamous carcinoma with response to therapy
Adenosquamous carcinoma with decreased sensitivity to therapy
Adenosquamous carcinoma with sensitivity to therapy


## Get evidence associated with those phenotypes.
The researcher can use those phenotype identifiers to query for evidence associations.


In [40]:
feature_phenotype_associations =  c.search_genotype_phenotype(
                                    phenotype_association_set_id=phenotype_association_set_id,
                                    phenotype_ids=[p.id for p in phenotypes])
associations = list(feature_phenotype_associations)
assert len(associations) >= len(phenotypes)
print "There are {} associations".format(len(associations))


There are 38 associations


## Further constrain associations with environment
The researcher can limit the associations returned by introducing the evironment contraint 

In [30]:
import ga4gh_client.protocol as protocol
evidence = protocol.EvidenceQuery()
evidence.description = "MEK inhibitors"
    
feature_phenotype_associations =  c.search_genotype_phenotype(
                                    phenotype_association_set_id=phenotype_association_set_id,
                                    phenotype_ids=[p.id for p in phenotypes],
                                    evidence = [evidence]
                                    )
associations = list(feature_phenotype_associations)
print "There are {} associations".format(len(associations))
print "\n".join([a.description for a in associations])

There are 13 associations
Association: genotype:[EGFR L861Q missense mutation] phenotype:[Adenosquamous carcinoma with sensitivity to therapy] environment:[MEK inhibitors] evidence:[sensitivity] publications:[http://www.ncbi.nlm.nih.gov/pubmed/23102728|http://www.ncbi.nlm.nih.gov/pubmed/23102728]
Association: genotype:[EGFR G719D missense mutation] phenotype:[Adenosquamous carcinoma with sensitivity to therapy] environment:[MEK inhibitors] evidence:[sensitivity] publications:[http://www.ncbi.nlm.nih.gov/pubmed/23102728]
Association: genotype:[EGFR T790M missense mutation] phenotype:[Adenosquamous carcinoma with sensitivity to therapy] environment:[MEK inhibitors] evidence:[sensitivity] publications:[http://www.ncbi.nlm.nih.gov/pubmed/23102728|http://www.ncbi.nlm.nih.gov/pubmed/23102728]
Association: genotype:[EGFR exon 19 p.729-761 deletion mutation] phenotype:[Adenosquamous carcinoma with sensitivity to therapy] environment:[MEK inhibitors] evidence:[sensitivity] publications:[http://

## Plotting associations
The `bokeh` package should be installed for graphing.
First, we collect a set of features.

In [31]:
feature_generator = c.search_features(feature_set_id=feature_set_id, name='.*KIT.*')
features = list(feature_generator)
assert len(features) > 0
print "Found {} features in G2P feature_set {}".format(len(features),feature_set_id)


Found 69 features in G2P feature_set WyIxa2dlbm9tZXMiLCJjZ2QiXQ


## Get all associations 
Then, we select all the associations for those features.

In [32]:
feature_phenotype_associations =  c.search_genotype_phenotype(
                                    phenotype_association_set_id=phenotype_association_set_id,
                                    feature_ids=[f.id  for f in features])
associations = list(feature_phenotype_associations)
print "There are {} associations".format(len(associations))

There are 52 associations


## Association Heatmap
Developers can use the G2P package to create researcher friendly applications.

In [33]:
from bokeh.charts import HeatMap, output_notebook, show 

from bokeh.layouts import column
from bokeh.models import ColumnDataSource
from bokeh.models.widgets import DataTable,   TableColumn
from bokeh.models import HoverTool


feature_ids = {}
for feature in features:
    feature_ids[feature.id]=feature.name

phenotype_descriptions = []
feature_names = []
association_count = [] 
association_descriptions = []

for association in associations:
    for feature_id in association.feature_ids:
        phenotype_descriptions.append(association.phenotype.description)
        feature_names.append(feature_ids[feature_id])
        association_count.append(1)
        association_descriptions.append(association.description)

output_notebook()
  
data = {'feature': feature_names  ,
        'association_count': association_count,
        'phenotype': phenotype_descriptions,
        'association_descriptions': association_descriptions
        }

hover = HoverTool(
        tooltips=[
            ("associations", "@values")
        ]
    )

hm = HeatMap(data, x='feature', y='phenotype', values='association_count',
             title='G2P Associations for KIT', stat='sum',
             legend=False,width=1024,
             tools=[hover], #"hover,pan,wheel_zoom,box_zoom,reset,tap",
             toolbar_location="above")

source = ColumnDataSource(data)
columns = [
        TableColumn(field="association_descriptions", title="Description"),
    ]
data_table = DataTable(source=source, columns=columns,width=1024 )

show( column(hm,data_table)  )
