# DataCure Text Mine Workflow
**Author**: Marc Jacobs

This notebook outlines the implementation of a `search for compound information from the literature`. Workflow :

1. authenticate
1. find proper concept to text mine ( [OLS](https://www.ebi.ac.uk/ols/docs/api), [TeMOwl](http://bart.scai.fraunhofer.de:9090/swagger-ui.html#/) )
1. find proper documents containing that concept ( [SCAIView](http://api.scaiview.com) )
1. further analyze documents with NLP ( SCAIView -> UIMA )

## Notebook Imports

In [1]:
import simplejson as json
import requests
import urllib
from requests.exceptions import HTTPError
from IPython.core.display import display, HTML
import pandas as pd

some Fraunhofer specific configurations

> **to be discussed**
> * which services should be deployed in the ORN infrastructure?
> * who is the contact there?
> * where to place the docker files / containers?

In [2]:
keycloak_uri = 'http://keycloak.scai.fraunhofer.de/auth/realms/SCAI-Bio/'
temowl_uri   = 'http://bart.scai.fraunhofer.de:9090/'
scaiview_uri = 'http://api.scaiview.com/api/'

## Step 1 - fetch security token

as a start we use the _keycloak server_ from Fraunhofer SCAI.

> **to be discussed** with the infrastructure team
> * how to use ORN authentication?
> * who is the contact there?

In [3]:
response = requests.post(
    keycloak_uri+'protocol/openid-connect/token', 
    data={'grant_type':'client_credentials', 'client_id':'temowl-backend', 'client_secret':'6f2879cd-0d00-457b-9909-abc78f374d73'}
)

token = response.json()['access_token']
token

'eyJhbGciOiJSUzI1NiIsInR5cCIgOiAiSldUIiwia2lkIiA6ICJudnV2UGU2c000TVQyOWFRNkV4dGI5TldoVFJramZWYmcwQTNUVTZNSFpzIn0.eyJqdGkiOiIzOTgwZGNiNS1lZjM5LTQ3Y2QtYjdlZS00OWU2ZDExNjQyNDIiLCJleHAiOjE1NTEzNTkzNTksIm5iZiI6MCwiaWF0IjoxNTUxMzQ4NTU5LCJpc3MiOiJodHRwOi8va2V5Y2xvYWsuc2NhaS5mcmF1bmhvZmVyLmRlL2F1dGgvcmVhbG1zL1NDQUktQmlvIiwiYXVkIjoidGVtb3dsLWJhY2tlbmQiLCJzdWIiOiI1NTg0YTA4Yi1lNmUxLTRlMWEtOTg2Yy1jODUyNDNiZWNlZDEiLCJ0eXAiOiJCZWFyZXIiLCJhenAiOiJ0ZW1vd2wtYmFja2VuZCIsImF1dGhfdGltZSI6MCwic2Vzc2lvbl9zdGF0ZSI6IjMyMDI4YTQ1LTkzNmYtNDE5ZC04MDY4LWU3MDIyYzVkOWJmZCIsImFjciI6IjEiLCJhbGxvd2VkLW9yaWdpbnMiOltdLCJyZWFsbV9hY2Nlc3MiOnsicm9sZXMiOlsib2ZmbGluZV9hY2Nlc3MiLCJ1bWFfYXV0aG9yaXphdGlvbiIsInVzZXIiXX0sInJlc291cmNlX2FjY2VzcyI6eyJyZWFsbS1tYW5hZ2VtZW50Ijp7InJvbGVzIjpbInZpZXctdXNlcnMiLCJxdWVyeS1ncm91cHMiLCJxdWVyeS11c2VycyJdfSwiYWNjb3VudCI6eyJyb2xlcyI6WyJtYW5hZ2UtYWNjb3VudCIsIm1hbmFnZS1hY2NvdW50LWxpbmtzIiwidmlldy1wcm9maWxlIl19fSwic2NvcGUiOiJwcm9maWxlIGVtYWlsIiwiY2xpZW50SWQiOiJ0ZW1vd2wtYmFja2VuZCIsImNsaWVudEhvc3QiOiI

## Step 2 - get concept for Acetaminophen

search for concept of `Acetaminophen` aka `Paracetamol` in `chebi`

In [4]:
response = requests.get(
    temowl_uri+'terminologies/chebi/concepts/search?',
    params={'exact': 'false', 'q': 'Paracetamol', 'lang': 'en', 'page': '0', 'size': '10'},
    headers={'Accept': 'application/json'},
)

json_response = response.json()
acetaminophen = json_response['content'][0]['conceptID']
acetaminophen

'chebi:46195'

got it: `Acetaminophen` is `chebi:46195`.

now retrieve details on `Paracetamol`:

In [5]:
response = requests.get(
    temowl_uri+'terminologies/chebi/concepts/'+urllib.parse.quote(acetaminophen),
    params={'lang': 'en', 'page': '0', 'size': '10'},
    headers={'Accept': 'application/json'},
)
json_response = response.json()
json_response['description']['description']

'A member of the class of phenols that is 4-aminophenol in which one of the hydrogens attached to the amino group has been replaced by an acetyl group.'

fetch some synonyms in order to expand the search query

In [6]:
response = requests.get(
    temowl_uri+'terminologies/chebi/concepts/'+urllib.parse.quote(acetaminophen)+'/labels',
    params={'lang': 'en', 'page': '0', 'size': '10'},
    headers={'Accept': 'application/json'},
)

full_text_query = '( Acetaminophen'

json_response = response.json()
for synonym in json_response['content']:
    full_text_query += ' OR ' + synonym['name']
    
full_text_query += ' )'
full_text_query

'( Acetaminophen OR Paracetamol OR N-(4-hydroxyphenyl)acetamide OR paracetamol )'

## Step 3 - get documents on Acetaminophen

> **to be discussed** - here we are using the different Fraunhofer pre-indexed document collections
> * which document collections does ToxPlanet have?
> * how are they indexed and searchable?
> * how do we access them?
> * who is the technical contact there?

### define some helper functions

fetch a pmid and render HTML

In [7]:
def fetchDocumentAndRenderHTML( pmid, collection ):
   "fetch document with pmid and return HTML"
   response = requests.get(
        scaiview_uri+'v3/fetch/' + urllib.parse.quote(pmid)+'?',
        params={'collection': collection},
        headers={'Accept': 'text/html', 'Authorization': 'Bearer '+token},
   )
   return HTML(response.text)

def fetchDocument( pmid, collection ):
   "fetch document with pmid and return HTML"
   response = requests.get(
        scaiview_uri+'v3/fetch/' + urllib.parse.quote(pmid)+'?',
        params={'collection': collection},
        headers={'Accept': 'application/json', 'Authorization': 'Bearer '+token},
   )
   return response.json()

search for documents

In [8]:
def querySCAIView( query, collection, limit = 10 ):
    "fetch a list of documents matching the query, return json"

    encoded_query = urllib.parse.quote(query)
        
    response = requests.get(
        scaiview_uri+'v2/solr/search?',
        params={'q':encoded_query, 'rows':limit, 'sortField':'date', 'sortOrder':'DESC', 'collection':collection},
        headers={'Accept': 'application/json', 'Authorization': 'Bearer '+token},
    )
    
    return response.json()
    

def querySCAIViewAsDocuments( query, collection, limit = 10 ):
    "fetch a list of documents matching the query"
    
    json_response = querySCAIView( query, collection, limit )

    print('{}: {} found'.format(query, json_response['numFound']))
    return json_response['documents']


def querySCAIViewAsPD( query, collection, limit = 10 ):
    "fetch a list of documents matching the query, return as pd"

    json_response =  querySCAIView( query, collection, limit )
    
    documents = json_response['documents']
    numFound = json_response['numFound']
    
    print('{}: {} found'.format(query, numFound))

    df = pd.DataFrame([], columns = ['PMID','Title'])

    for x in range(0, min(len(documents),numFound-1)):
        df = df.append(pd.DataFrame([[documents[x]['id'][5:], documents[x]['title'][0]]], index = [x+1], columns = ['PMID','Title']))

    return df

### query for documents for Acetaminophen


*recall based* - with free text search

In [9]:
df = querySCAIViewAsPD(full_text_query + ' AND carcinogen*', 'Medline_2019', 10)
df.sort_values(by=['PMID'], ascending=False)

( Acetaminophen OR Paracetamol OR N-(4-hydroxyphenyl)acetamide OR paracetamol ) AND carcinogen*: 2287595 found


Unnamed: 0,PMID,Title
3,30640403,[Recurrence and chronicity of major depressive...
4,30640402,[Electroconvulsion therapy for persistent depr...
2,30572411,The Effect of Lower Transaction Costs on Socia...
1,30572410,Immigration Enforcement and Children's Living ...
5,30103352,Medication reconciliation and review for older...
6,29169162,Application of Electrocautery Needle Knife Com...
7,29166637,Assessing Vessel Tone during Coronary Artery S...
8,29045949,Effectiveness of a Home-Based Active Video Gam...
9,28881352,Current Practice of Airway Stenting in the Adu...
10,28864973,The Role of Oxytocin in Social Buffering: What...


*precision based* - using indexed substances field

In [10]:
df = querySCAIViewAsPD('Substances:Carcinogens AND Substances:Acetaminophen', 'Medline_2019', 10)
df.sort_values(by=['PMID'], ascending=False)

Substances:Carcinogens AND Substances:Acetaminophen: 146138 found


Unnamed: 0,PMID,Title
5,30589062,"Occupational risk perception, stressors and st..."
3,30568084,[Study on the Establishment of a Specific Simi...
8,30288716,Supercooling-Promoting (Anti-ice Nucleation) S...
4,30272428,What Interventions Work Best for Families Who ...
7,30199073,[Current trends of the choice and processing o...
6,30199059,[The in vitro examination of the effectiveness...
10,30176634,[Ultramicroscopic features of cells and vessel...
9,29895139,Methods for the initial (non-laboratory) asses...
2,28466189,"A Review of the Environmental Degradation, Eco..."
1,28444578,Experimental Psychosis Research and Schizophre...


### Step 4 - mine full text documents

In [11]:
documents = querySCAIViewAsDocuments(full_text_query + ' AND carcinogen*', 'PMC_2019', 10)
documents

( Acetaminophen OR Paracetamol OR N-(4-hydroxyphenyl)acetamide OR paracetamol ) AND carcinogen*: 2031480 found


[{'id': 'PMCID:PMC6318039',
  'title': ['Epidemiology of prediabetes and diabetes in Namibia, Africa: a multilevel analysis'],
  'abstract': ['Diabetes is a leading cause of progressive morbidity and early mortality worldwide. Little is known on the burden of diabetes and pre-diabetes in Namibia, a Sub-Saharan African (SSA) country that is undergoing a demographic transition.We estimated the prevalence and correlates of diabetes (defined as fasting [capillary] blood glucose [FBG]>126 mg/dL) and prediabetes (defined by World Health Organization [WHO] and American Diabetes Association [ADA] criteria [FBG 110–125 mg/dL and 100–125 mg/dL, respectively]) in a random sample of 3278 participants aged 35–64 from the 2013 Namibia Demographic and Health Survey.The prevalence of diabetes was 5.1% (95% Confidence Interval [CI]: 4.2–6.2), with no evidence of gender differences (p=0.45). The prevalence of prediabetes was 6.8% (5.8–8.0) and 20.1% (18.4–21.9) using WHO and ADA criteria, respectively. 

In [12]:
fetchDocumentAndRenderHTML('PMCID:PMC6319626', 'PMC_2019')

In [13]:
fetchDocument('PMCID:PMC6319626', 'PMC_2019')

{'provenance': {'license': None,
  'version': '1.0',
  'source': 'file:/home/bio/groupshare/library/PMC/manuscript_2019/006/PMC006XXXXXX.xml.tar.gz',
  'date': 1549370953740,
  'collection': 'PMC_2019',
  'comments': None},
 'documentElement': {'metaElement': {'bibliographic': {'documentAbstract': {'abstractSections': [{'paragraphs': [{'sentences': None,
         'structureElements': [{'captionedBox': None,
           'code': None,
           'dataTable': None,
           'figure': None,
           'formula': None,
           'imageContent': None,
           'outline': None,
           'quotation': None,
           'table': None,
           'textElement': {'text': 'Cytochrome P4502E1 (CYP2E1) is involved in the biotransformation of several low molecular weight chemicals and plays an important role in the metabolic activation of carcinogens and hepatotoxins such as CCl',
            'uuid': 'dcd86d65-c171-4b39-be75-ec59424effa6',
            'annotations': None},
           'sentence': 

In [14]:
data = pd.io.json.json_normalize(documents)
data

Unnamed: 0,_version_,abstract,authors,date,docType,documentIdentifiers,id,journal,language,source,title
0,1624645317434540032,[Diabetes is a leading cause of progressive mo...,"[Victor T. ADEKANMBI, Olalekan A. UTHMAN, Sebh...",1583078512499,[PMC_FULLTEXT],"[PMID:30058263, DOI:10.1111/1753-0407.12829, M...",PMCID:PMC6318039,J Diabetes,[EN],J Diabetes,[Epidemiology of prediabetes and diabetes in N...
1,1624645043740475392,[CRISPR-Cas9 and Cas12a (Cpf1) nucleases are t...,"[Keunsub Lee, Yingxiao Zhang, Benjamin P. Klei...",1583078251468,[PMC_FULLTEXT],"[PMID:29972722, DOI:10.1111/pbi.12982, MANUSCR...",PMCID:PMC6320322,Plant Biotechnol. J.,[EN],Plant Biotechnol. J.,[Activities and specificities of CRISPR-Cas9 a...
2,1624645050990329856,"[Cachexia, the unintentional loss of body weig...","[Justin P. Hardee, Brittany R. Counts, James A...",1580572658400,[PMC_FULLTEXT],"[PMID:30627079, DOI:10.1177/1559827617725283, ...",PMCID:PMC6311610,Am J Lifestyle Med,[EN],Am J Lifestyle Med,[Understanding the Role of Exercise in Cancer ...
3,1624644678843367424,[Meniscus injuries are among the most common o...,"[Chathuraka T. Jayasuriya, John Twomey-Kozak, ...",1580572303495,[PMC_FULLTEXT],"[PMID:30358021, DOI:10.1002/stem.2923, MANUSCR...",PMCID:PMC6312732,Stem Cells,[EN],Stem Cells,[Human cartilage-derived progenitors resist te...
4,1624644414228922368,[Perivascular accumulation of lymphocytes can ...,"[Alexander M. S. Barron, Julio C. Mantero, Jon...",1580572051111,[PMC_FULLTEXT],"[PMID:30510068, DOI:10.4049/jimmunol.1801209, ...",PMCID:PMC6305793,J. Immunol.,[EN],J. Immunol.,[Perivascular Adventitial Fibroblast Specializ...
5,1624644053385609216,[Human embryonic stem cell-derived cardiomyocy...,"[Ana De La Mata, Sendoa Tajada, Samantha O’Dwy...",1580571707008,[PMC_FULLTEXT],"[PMID:30353632, DOI:10.1002/stem.2927, MANUSCR...",PMCID:PMC6312737,Stem Cells,[EN],Stem Cells,[BIN1 induces the formation of T-tubules and a...
6,1624631335091961856,[About 1 in 5 child deaths is a result of unin...,"[Ann Dellinger, Julie Gilchrist]",1580559577896,[PMC_FULLTEXT],"[PMID:28845146, DOI:10.1177/1559827617696297, ...",PMCID:PMC5568777,Am J Lifestyle Med,[EN],Am J Lifestyle Med,[Leading Causes of Fatal and Nonfatal Unintent...
7,1624643963233239040,[[No abstract available]],"[Heather R. McGregor, Joshua G.A. Cashaback, P...",1578066021026,[PMC_FULLTEXT],"[PMID:30513324, DOI:10.1016/j.cub.2018.11.043,...",PMCID:PMC6317725,Curr. Biol.,[EN],Curr. Biol.,[Functional Plasticity in Somatosensory Cortex...
8,1624639015276773376,[Mice with disruption of Lrrk1 and patients wi...,"[Mingjue Si, Helen Goodluck, Canjun Zeng, Song...",1577888502292,[PMC_FULLTEXT],"[PMID:30136304, DOI:10.1002/jcb.27377, MANUSCR...",PMCID:PMC6218268,J. Cell. Biochem.,[EN],J. Cell. Biochem.,[LRRK1 Regulation of Actin Assembly in Osteocl...
9,1624638669631520768,[Mitochondrial function requires the coordinat...,"[David M. Rand, Jim A. Mossman, Lei Zhu, Leann...",1577888172659,[PMC_FULLTEXT],"[PMID:30394643, DOI:10.1002/iub.1954, MANUSCRI...",PMCID:PMC6268205,IUBMB Life,[EN],IUBMB Life,"[Mitonuclear epistasis, genotype-by-environmen..."


**to do**: filter documents for relevant sentences and display them ...