# DataCure Text Mine Workflow
**Author**: Marc Jacobs, [Fraunhofer SCAI](https://www.scai.fraunhofer.de/en.html) (Mar 2019)

This notebook outlines the implementation of a `search for compound information from the literature`. Workflow to be demonstrated:

1. authenticate ([keycloak](http://keycloak.scai.fraunhofer.de/auth/realms/SCAI-Bio/))
1. find proper concept to text mine ( [OLS](https://www.ebi.ac.uk/ols/docs/api), [TeMOwl](http://bart.scai.fraunhofer.de:9090/swagger-ui.html#/) )
1. find proper documents containing that concept ( [SCAIView](http://api.scaiview.com) )
1. further analyze documents with NLP ( SCAIView -> UIMA )

## Notebook Imports

In [1]:
import simplejson as json
import requests
import urllib
from requests.exceptions import HTTPError
from IPython.core.display import display, HTML
import pandas as pd
import getpass

Some Fraunhofer specific configurations of the APIs to be used in this session.

* an authentication service
* an identifier and mapping service
* a retrieval and text mining service

All of them are currently hosted at SCAI. The containers will be either deployed deployed in the OpenRiskNet infrastructure or replaced by services within ORN.

In [2]:
keycloak_uri = 'http://keycloak.scai.fraunhofer.de/auth/realms/SCAI-Bio/'
temowl_uri   = 'http://bart.scai.fraunhofer.de:9090/'
scaiview_uri = 'http://api.scaiview.com/api/'

## Step 1 - fetch security token

All API calls are secured via OAuth 2. This means you need a security token to be passed along. <br>
As a start we use the _keycloak server_ from Fraunhofer SCAI (this will be replaced in the near future with the SSO of OpenRiskNet). At the moment we are sending a _client id_ and a _client secret_ in order to request a valid token.



In [18]:
clientSecret = getpass.getpass('Password:')

response = requests.post(
    keycloak_uri+'protocol/openid-connect/token', 
    data={'grant_type':'client_credentials', 'client_id':'temowl-backend', 'client_secret':clientSecret}
)

token = response.json()['access_token']
token[0:10]

Password:········


'eyJhbGciOi'

## Step 2 - get concept for Acetaminophen

search for concept of `Acetaminophen` aka `Paracetamol` in `chebi`

In [4]:
response = requests.get(
    temowl_uri+'terminologies/chebi/concepts/search?',
    params={'exact': 'false', 'q': 'Paracetamol', 'lang': 'en', 'page': '0', 'size': '10'},
    headers={'Accept': 'application/json'},
)

json_response = response.json()
acetaminophen = json_response['content'][0]['conceptID']
acetaminophen

'chebi:46195'

got it: `Acetaminophen` is `chebi:46195`.

now retrieve details on `Paracetamol`:

In [5]:
response = requests.get(
    temowl_uri+'terminologies/chebi/concepts/'+urllib.parse.quote(acetaminophen),
    params={'lang': 'en', 'page': '0', 'size': '10'},
    headers={'Accept': 'application/json'},
)
json_response = response.json()
json_response['description']['description']

'A member of the class of phenols that is 4-aminophenol in which one of the hydrogens attached to the amino group has been replaced by an acetyl group.'

fetch some synonyms in order to expand the search query

In [6]:
response = requests.get(
    temowl_uri+'terminologies/chebi/concepts/'+urllib.parse.quote(acetaminophen)+'/labels',
    params={'lang': 'en', 'page': '0', 'size': '10'},
    headers={'Accept': 'application/json'},
)

full_text_query = '( Acetaminophen'

json_response = response.json()
for synonym in json_response['content']:
    full_text_query += ' OR ' + synonym['name']
    
full_text_query += ' )'
full_text_query

'( Acetaminophen OR Paracetamol OR N-(4-hydroxyphenyl)acetamide OR paracetamol )'

## Step 3 - get documents on Acetaminophen

> **to be discussed** - here we are using the different Fraunhofer pre-indexed document collections
> * which document collections does ToxPlanet have?
> * how are they indexed and searchable?
> * how do we access them?
> * who is the technical contact there?

### define some helper functions

fetch a pmid and render HTML

In [7]:
def fetchDocumentAndRenderHTML( pmid, collection ):
   "fetch document with pmid and return HTML"
   response = requests.get(
        scaiview_uri+'v3/fetch/' + urllib.parse.quote(pmid)+'?',
        params={'collection': collection},
        headers={'Accept': 'text/html', 'Authorization': 'Bearer '+token},
   )
   return HTML(response.text)

def fetchDocument( pmid, collection ):
   "fetch document with pmid and return HTML"
   response = requests.get(
        scaiview_uri+'v3/fetch/' + urllib.parse.quote(pmid)+'?',
        params={'collection': collection},
        headers={'Accept': 'application/json', 'Authorization': 'Bearer '+token},
   )
   return response.json()

search for documents

In [8]:
def querySCAIView( query, collection, limit = 10 ):
    "fetch a list of documents matching the query, return json"

    encoded_query = urllib.parse.quote(query)
        
    response = requests.get(
        scaiview_uri+'v2/solr/search?',
        params={'q':encoded_query, 'rows':limit, 'sortField':'date', 'sortOrder':'DESC', 'collection':collection},
        headers={'Accept': 'application/json', 'Authorization': 'Bearer '+token},
    )
    
    return response.json()
    

def querySCAIViewAsDocuments( query, collection, limit = 10 ):
    "fetch a list of documents matching the query"
    
    json_response = querySCAIView( query, collection, limit )

    print('{}: {} found'.format(query, json_response['numFound']))
    return json_response['documents']


def querySCAIViewAsPD( query, collection, limit = 10 ):
    "fetch a list of documents matching the query, return as pd"

    json_response =  querySCAIView( query, collection, limit )
    
    documents = json_response['documents']
    numFound = json_response['numFound']
    
    print('{}: {} found'.format(query, numFound))

    df = pd.DataFrame([], columns = ['PMID','Title'])

    for x in range(0, min(len(documents),numFound-1)):
        df = df.append(pd.DataFrame([[documents[x]['id'][5:], documents[x]['title'][0]]], index = [x+1], columns = ['PMID','Title']))

    return df

### query for documents for Acetaminophen


*recall based* - with free text search

In [30]:
df = querySCAIViewAsPD(full_text_query + ' AND carcinogen* to human', 'Medline_2019', 10)
df.sort_values(by=['PMID'], ascending=False)

( Acetaminophen OR Paracetamol OR N-(4-hydroxyphenyl)acetamide OR paracetamol ) AND carcinogen* to human: 2291421 found


Unnamed: 0,PMID,Title
3,30640403,[Recurrence and chronicity of major depressive...
4,30640402,[Electroconvulsion therapy for persistent depr...
2,30572411,The Effect of Lower Transaction Costs on Socia...
1,30572410,Immigration Enforcement and Children's Living ...
5,30103352,Medication reconciliation and review for older...
6,29169162,Application of Electrocautery Needle Knife Com...
7,29166637,Assessing Vessel Tone during Coronary Artery S...
8,29045949,Effectiveness of a Home-Based Active Video Gam...
9,28881352,Current Practice of Airway Stenting in the Adu...
10,28864973,The Role of Oxytocin in Social Buffering: What...


*precision based* - using indexed substances field

In [10]:
df = querySCAIViewAsPD('Substances:Carcinogens AND Substances:Acetaminophen', 'Medline_2019', 10)
df.sort_values(by=['PMID'], ascending=False)

Substances:Carcinogens AND Substances:Acetaminophen: 146327 found


Unnamed: 0,PMID,Title
5,30589062,"Occupational risk perception, stressors and st..."
3,30568084,[Study on the Establishment of a Specific Simi...
4,30272428,What Interventions Work Best for Families Who ...
7,30199073,[Current trends of the choice and processing o...
6,30199059,[The in vitro examination of the effectiveness...
9,30141316,Determination of volatile organic compounds in...
10,30069765,Larotrectinib (LOXO-101).
8,29895139,Methods for the initial (non-laboratory) asses...
2,28466189,"A Review of the Environmental Degradation, Eco..."
1,28444578,Experimental Psychosis Research and Schizophre...


### Step 4 - mine full text documents

In [42]:
documents = querySCAIViewAsDocuments('document:paracetamol', 'PMC_2019', 10)
documents

document:paracetamol: 102636 found


[{'id': 'PMCID:PMC6176107',
  'title': ['The Heat Shock Protein, gp96, activates inflammasome signaling platforms in antigen presenting cells'],
  'abstract': ['Several heat shock proteins (HSP) prime immune responses which are, in part, a result of activation of antigen presenting cells (APCs). APCs respond to these immunogenic HSPs by up-regulating co-stimulatory molecules and secreting cytokines including IL-1β. These HSP-mediated responses are central mediators in pathological conditions ranging from cancer, sterile inflammation associated with trauma and rheumatoid arthritis. We tested here the requirement of inflammasomes in the release of IL-1β by one immunogenic HSP, gp96. Our results show that murine APCs activate NLRP3 inflammasomes in response to gp96, by K+efflux. This is shown to initiate inflammatory conditionsin vivo, in the absence of additional known inflammasome activators or infection. These results document a novel mechanism by which proteins of endogenous origin, t

In [19]:
fetchDocumentAndRenderHTML('PMCID:PMC2018698', 'PMC_2019')


In [22]:
fetchDocument('PMCID:PMC2018698', 'PMC_2019')['documentElement']['bodyMatter']['sections']


[{'paragraphs': [{'sentences': None,
    'structureElements': [{'captionedBox': None,
      'code': None,
      'dataTable': None,
      'figure': None,
      'formula': None,
      'imageContent': None,
      'outline': None,
      'quotation': None,
      'table': None,
      'textElement': {'text': 'Bladder cancer is the 4',
       'uuid': 'a11a8052-8982-4bf0-84a5-907714a1c371',
       'annotations': None},
      'sentence': None,
      'list': None},
     {'captionedBox': None,
      'code': None,
      'dataTable': None,
      'figure': None,
      'formula': None,
      'imageContent': None,
      'outline': None,
      'quotation': None,
      'table': None,
      'textElement': {'text': 'th',
       'uuid': '92c03ab7-d1df-4348-a78d-84de867d7cd4',
       'annotations': None},
      'sentence': None,
      'list': None},
     {'captionedBox': None,
      'code': None,
      'dataTable': None,
      'figure': None,
      'formula': None,
      'imageContent': None,
      'outline'

In [14]:
data = pd.io.json.json_normalize(documents)
data

Unnamed: 0,_version_,abstract,authors,date,docType,documentIdentifiers,id,journal,language,source,title
0,1624645317434540032,[Diabetes is a leading cause of progressive mo...,"[Victor T. ADEKANMBI, Olalekan A. UTHMAN, Sebh...",1583078512499,[PMC_FULLTEXT],"[PMID:30058263, DOI:10.1111/1753-0407.12829, M...",PMCID:PMC6318039,J Diabetes,[EN],J Diabetes,[Epidemiology of prediabetes and diabetes in N...
1,1624645043740475392,[CRISPR-Cas9 and Cas12a (Cpf1) nucleases are t...,"[Keunsub Lee, Yingxiao Zhang, Benjamin P. Klei...",1583078251468,[PMC_FULLTEXT],"[PMID:29972722, DOI:10.1111/pbi.12982, MANUSCR...",PMCID:PMC6320322,Plant Biotechnol. J.,[EN],Plant Biotechnol. J.,[Activities and specificities of CRISPR-Cas9 a...
2,1624645050990329856,"[Cachexia, the unintentional loss of body weig...","[Justin P. Hardee, Brittany R. Counts, James A...",1580572658400,[PMC_FULLTEXT],"[PMID:30627079, DOI:10.1177/1559827617725283, ...",PMCID:PMC6311610,Am J Lifestyle Med,[EN],Am J Lifestyle Med,[Understanding the Role of Exercise in Cancer ...
3,1624644678843367424,[Meniscus injuries are among the most common o...,"[Chathuraka T. Jayasuriya, John Twomey-Kozak, ...",1580572303495,[PMC_FULLTEXT],"[PMID:30358021, DOI:10.1002/stem.2923, MANUSCR...",PMCID:PMC6312732,Stem Cells,[EN],Stem Cells,[Human cartilage-derived progenitors resist te...
4,1624644414228922368,[Perivascular accumulation of lymphocytes can ...,"[Alexander M. S. Barron, Julio C. Mantero, Jon...",1580572051111,[PMC_FULLTEXT],"[PMID:30510068, DOI:10.4049/jimmunol.1801209, ...",PMCID:PMC6305793,J. Immunol.,[EN],J. Immunol.,[Perivascular Adventitial Fibroblast Specializ...
5,1624644053385609216,[Human embryonic stem cell-derived cardiomyocy...,"[Ana De La Mata, Sendoa Tajada, Samantha O’Dwy...",1580571707008,[PMC_FULLTEXT],"[PMID:30353632, DOI:10.1002/stem.2927, MANUSCR...",PMCID:PMC6312737,Stem Cells,[EN],Stem Cells,[BIN1 induces the formation of T-tubules and a...
6,1624631335091961856,[About 1 in 5 child deaths is a result of unin...,"[Ann Dellinger, Julie Gilchrist]",1580559577896,[PMC_FULLTEXT],"[PMID:28845146, DOI:10.1177/1559827617696297, ...",PMCID:PMC5568777,Am J Lifestyle Med,[EN],Am J Lifestyle Med,[Leading Causes of Fatal and Nonfatal Unintent...
7,1624643963233239040,[[No abstract available]],"[Heather R. McGregor, Joshua G.A. Cashaback, P...",1578066021026,[PMC_FULLTEXT],"[PMID:30513324, DOI:10.1016/j.cub.2018.11.043,...",PMCID:PMC6317725,Curr. Biol.,[EN],Curr. Biol.,[Functional Plasticity in Somatosensory Cortex...
8,1624637987533881344,[Although recent declines in life expectancy a...,"[Peter A. Muennig, Megan Reynolds, David S. Fi...",1577887522164,[PMC_FULLTEXT],"[PMID:30252522, DOI:10.2105/AJPH.2018.304585, ...",PMCID:PMC6221922,Am J Public Health,[EN],Am J Public Health,"[America’s Declining Well-Being, Health, and L..."
9,1624637779106332672,[Many bacterial species use the MecA/ClpCP pro...,"[M. Son, J. Kaspar, S.J. Ahn, R.A. Burne, S.J....",1577887323393,[PMC_FULLTEXT],"[PMID:29873131, DOI:10.1111/mmi.13992, MANUSCR...",PMCID:PMC6281771,Mol. Microbiol.,[EN],Mol. Microbiol.,[Threshold regulation and stochasticity from t...


**to do**: filter documents for relevant sentences and display them ...

goal to classify compounds into [Agents Classified by the IARC Monographs](https://monographs.iarc.fr/agents-classified-by-the-iarc/) :
* Group 1 - Carcinogenic to humans
* Group 2A - Probably carcinogenic to humans
* Group 2B - Possibly carcinogenic to humans
* Group 3 - Not classifiable as to its carcinogenicity to humans



In [24]:
fetchDocument('PMCID:PMC2018698', 'PMC_2019')['documentElement']['metaElement']


{'bibliographic': {'documentAbstract': {'abstractSections': [{'paragraphs': [{'sentences': None,
       'structureElements': [{'captionedBox': None,
         'code': None,
         'dataTable': None,
         'figure': None,
         'formula': None,
         'imageContent': None,
         'outline': None,
         'quotation': None,
         'table': None,
         'textElement': {'text': 'Use of phenacetin and other analgesic and non-steroidal anti-inflammatory drugs (NSAIDs) potentially influences bladder cancer incidence, but epidemiologic evidence is limited.',
          'uuid': '024b0114-ee64-488f-ab6c-95ded6948e76',
          'annotations': None},
         'sentence': None,
         'list': None}]}],
     'depth': 1,
     'rhetorical': {'text': 'Background',
      'uuid': 'af2be0c4-d4c6-46f3-9190-51f9e51bf41f',
      'annotations': None},
     'title': {'text': 'Background',
      'uuid': '4dda2ec6-b92e-4959-913c-23d5dadaa957',
      'annotations': None}},
    {'paragraphs': [{'