# Simple Example of multiple biological data source acquisition

This notebook is intended as a starting point for other researchers and domain experts to explore and experiment with various data sources and how they can be utilized to build pipelines to support a Blackboard architecture that can address important science. Our initial modest goals are to focus on the Translator *competency questions* and begin to incorporate and integrate those data sources we anticipate being useful.

## Typical Structure

This Notebook, and those that are cloned from it, will follow a typical structure like this:

- Background
    - Relevant Competency Question(s) or Research Problem
    - Current Status and remaining work (just to give the reader context about how finished the notebook is)
- Data Sources
    - Descriptions and reference, including the API documentation links and a brief description of their scope and content
- Transformation and Integration
    - Simple Data Access examples to illustrate the API usage and the type/shape of the data
    - More sophisticated examples to examine sources and experiment/demonstrate integration possibilities
    - Visualization and Summarization
- Develop Prototype Pipelines (optional)
    - Where possible, prototype a reusable set of code illustrating a desired solution or capability, with an eye towards extracting and modularizing it for presentation via BioLink or integration into other workflows.

## Background

### Current Status

- Use Jupyter notebooks (or Zeppelin or Galaxy) as a flexible and open workbench
- Explain and demonstrate API access to different data sources
- (BONUS POINTS) Demonstrate integration, comparison and visualization of diverse data sources

### Simplifications for this first version

- We're going to look up 'acetylsalicylic acid' rather than 'aspirin', because it is a common term in all of the sources right now and I'm not sure that the Monarch BioLink API I'm using has the term 'aspirin' yet.


#### ToDo

- Use binder badge to simplify editing from GitHub
- New goal: Drug-to-conditions and Condition-to-Drugs relations
- Consider WikiData as a source
- Accommodate anticipated BROAD probability models
- https://pharos.nih.gov/idg/index
- Competency Questions


## Data Sources

### CHEBI Data

Monarch ingests [Chemical Entities of Biological Interest (ChEBI)](https://www.ebi.ac.uk/chebi/) data and makes it available via SciGraph, the Monarch API, and the new BioLink API.

For reference, here is the link to CHEBI's entry for 'acetylsalicylic acid' (aka 'Aspirin'):

[CHEBI:15365 acetylsalicylic acid](https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:15365)


### BioLink substance data from CHEBI via Monarch

Monarch has ingested CHEBI data, and we have a `/biolink/substance/{id}/participant_in/` endpoint that seems to return some data:

https://api.monarchinitiative.org/api/bioentity/substance/CHEBI:40036/participant_in/

However, the basic `/biolink/substance/{id}` endpoint returns no useful data, so we'll have to use the above link until BioLink has a fleshed out `/substance` endpoint.



### GINAS API Substance data from GINAS
[GINAS Aspirin](https://tripod.nih.gov/ginas/app/api/v1/substances/search?q=root_names_name:"^ASPIRIN$”)

[GINAS acetylsalicylic acid](https://tripod.nih.gov/ginas/app/api/v1/substances/search?q=root_names_name:"^acetylsalicylic acid$”)




## Transformation and Integration

I'm going to start out by ensuring that I can obtain useful data from each of the above sources. In this case, I am focusing on a single substance, **aspirin** or **acetylsalicylic acid** (CHEBI:15365).


#### Reading BioLink's `/substance` endpoint

In [1]:
import pandas as pd
from urllib.parse import urlencode
pd.set_option('max_colwidth', 3800)
pd.set_option('display.expand_frame_repr', False)
biolinkURL = "https://api.monarchinitiative.org/api/bioentity/substance/CHEBI%3A15365/participant_in/?rows=20&fetch_objects=true"
df = pd.read_json(biolinkURL, typ="frame", orient="records")
# df.head(5)
df


Unnamed: 0,evidence_graph,evidence_types,id,object,object_extension,provided_by,publications,qualifiers,relation,slim,subject,subject_extension,type
0,"{'edges': None, 'nodes': None}",,,"{'label': 'benzoic acids', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:22723'}",,,,,"{'label': None, 'categories': None, 'synonyms': None, 'description': None, 'types': None, 'id': None}",,"{'label': 'acetylsalicylic acid', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:15365'}",,
1,"{'edges': None, 'nodes': None}",,,"{'label': 'acetate ester', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:47622'}",,,,,"{'label': None, 'categories': None, 'synonyms': None, 'description': None, 'types': None, 'id': None}",,"{'label': 'acetylsalicylic acid', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:15365'}",,
2,"{'edges': None, 'nodes': None}",,,"{'label': 'cyclooxygenase 1 inhibitor', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:50630'}",,,,,"{'label': None, 'categories': None, 'synonyms': None, 'description': None, 'types': None, 'id': None}",,"{'label': 'acetylsalicylic acid', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:15365'}",,
3,"{'edges': None, 'nodes': None}",,,"{'label': 'EC 1.1.1.188 (prostaglandin-F synthase) inhibitor', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:77425'}",,,,,"{'label': None, 'categories': None, 'synonyms': None, 'description': None, 'types': None, 'id': None}",,"{'label': 'acetylsalicylic acid', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:15365'}",,
4,"{'edges': None, 'nodes': None}",,,"{'label': None, 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'OBO:upheno/monarch.owl'}",,,,,"{'label': None, 'categories': None, 'synonyms': None, 'description': None, 'types': None, 'id': None}",,"{'label': 'acetylsalicylic acid', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:15365'}",,
5,"{'edges': None, 'nodes': None}",,,"{'label': 'non-narcotic analgesic', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:35481'}",,,,,"{'label': None, 'categories': None, 'synonyms': None, 'description': None, 'types': None, 'id': None}",,"{'label': 'acetylsalicylic acid', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:15365'}",,
6,"{'edges': None, 'nodes': None}",,,"{'label': 'non-steroidal anti-inflammatory drug', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:35475'}",,,,,"{'label': None, 'categories': None, 'synonyms': None, 'description': None, 'types': None, 'id': None}",,"{'label': 'acetylsalicylic acid', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:15365'}",,
7,"{'edges': None, 'nodes': None}",,,"{'label': 'platelet aggregation inhibitor', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:50427'}",,,,,"{'label': None, 'categories': None, 'synonyms': None, 'description': None, 'types': None, 'id': None}",,"{'label': 'acetylsalicylic acid', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:15365'}",,
8,"{'edges': None, 'nodes': None}",,,"{'label': 'prostaglandin antagonist', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:49023'}",,,,,"{'label': None, 'categories': None, 'synonyms': None, 'description': None, 'types': None, 'id': None}",,"{'label': 'acetylsalicylic acid', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:15365'}",,
9,"{'edges': None, 'nodes': None}",,,"{'label': 'salicylates', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:26596'}",,,,,"{'label': None, 'categories': None, 'synonyms': None, 'description': None, 'types': None, 'id': None}",,"{'label': 'acetylsalicylic acid', 'taxon': {'label': None, 'id': None}, 'description': None, 'xrefs': None, 'synonyms': None, 'categories': None, 'types': None, 'id': 'CHEBI:15365'}",,


Now that we see the data frame from our source, we can use ordinary Python (and the pandas library) to access different parts. For example, let's grab the first row's `object` value.

In [2]:
df.object[0]

{'categories': None,
 'description': None,
 'id': 'CHEBI:22723',
 'label': 'benzoic acids',
 'synonyms': None,
 'taxon': {'id': None, 'label': None},
 'types': None,
 'xrefs': None}

In [3]:
df.object[0]['label']

'benzoic acids'

In [4]:
ginasBase = "https://tripod.nih.gov/ginas/app/api/v1/substances/search?"
ginasParams = {'q': "root_names_name:\"^acetylsalicylic acid$\""}

ginasPath = urlencode(ginasParams)
ginasURL = ginasBase + ginasPath

# ginasURL = 'https://tripod.nih.gov/ginas/app/api/v1/substances/search?q=root_names_name:\"^acetylsalicylic%20acid$"'
print(ginasURL)

https://tripod.nih.gov/ginas/app/api/v1/substances/search?q=root_names_name%3A%22%5Eacetylsalicylic+acid%24%22


In [5]:
import json

import requests

r = requests.get(ginasURL)
c = r.json()
print(json.dumps(c, indent=2))



{
  "path": "/ginas/app/api/v1/substances/search",
  "query": "q=root_names_name:\"^acetylsalicylic acid$\"",
  "sideway": [],
  "count": 1,
  "id": 27263343,
  "etag": "a5a769bad894db88",
  "total": 1,
  "created": 1489119956013,
  "facets": [
    {
      "name": "Code System",
      "values": [
        {
          "label": "EMA ASSESSMENT REPORTS",
          "count": 1
        },
        {
          "label": "EPA PESTICIDE CODE",
          "count": 1
        },
        {
          "label": "IUPHAR",
          "count": 1
        },
        {
          "label": "LIVERTOX",
          "count": 1
        },
        {
          "label": "NDF-RT",
          "count": 1
        },
        {
          "label": "RXCUI",
          "count": 1
        },
        {
          "label": "WHO INTERNATIONAL PHARMACPOEIA",
          "count": 1
        },
        {
          "label": "WHO-ATC",
          "count": 1
        },
        {
          "label": "WHO-ESSENTIAL MEDICINES LIST",
          "count": 

##### Work in Progress

The GINAS JSON result is not compatible with a pandas data frame, so we will have to use an alternative way to manipulate it, or try to convert it to a data frame.

**The following code breaks**

In [6]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 3000)


df = pd.read_json(ginasURL, typ='frame', orient="index")
# df.head(5)
df

ValueError: arrays must all be same length