# DataCure Text Mine Workflow
**Author**: Marc Jacobs, [Fraunhofer SCAI](https://www.scai.fraunhofer.de/en.html) (Mar 2019)

This notebook outlines the implementation of a `search for compound information from the literature`. Workflow to be demonstrated:

1. authenticate ([keycloak](http://keycloak.scai.fraunhofer.de/auth/realms/SCAI-Bio/))
1. find proper concept to text mine ( [OLS](https://www.ebi.ac.uk/ols/docs/api), [TeMOwl](http://bart.scai.fraunhofer.de:9090/swagger-ui.html#/) )
1. find proper documents containing that concept ( [SCAIView](http://api.scaiview.com) )
1. further analyze documents with NLP ( SCAIView -> UIMA )

## Notebook Imports

In [2]:
import simplejson as json
import requests
import urllib
from requests.exceptions import HTTPError
from IPython.core.display import display, HTML
import pandas as pd
import getpass

Some Fraunhofer specific configurations of the APIs to be used in this session.

* an authentication service
* an identifier and mapping service
* a retrieval and text mining service

All of them are currently hosted at SCAI. The containers will be either deployed deployed in the OpenRiskNet infrastructure or replaced by services within ORN.

In [3]:
keycloak_uri = 'https://keycloak.scai.fraunhofer.de/auth/realms/SCAI-bio/'
temowl_uri   = 'https://sam1.api.scaiview.com/'
scaiview_uri = 'https://api.scaiview.com/api/v5/'

## Step 1 - fetch security token

All API calls are secured via OAuth 2. This means you need a security token to be passed along. <br>
As a start we use the _keycloak server_ from Fraunhofer SCAI (this will be replaced in the near future with the SSO of OpenRiskNet). At the moment we are sending a _client id_ and a _client secret_ in order to request a valid token.



In [5]:
username = 'marc.jacobs@scai.fraunhofer.de'
password = getpass.getpass('Password:')


Password:········


In [131]:
response = requests.post(

    keycloak_uri+'protocol/openid-connect/token', 
    data={'Content-Type':'application/x-www-form-urlencoded', 'username':username, 'password':password,'grant_type':'password', 'client_id':'curl'}
)

print(response) 
token = 'Bearer ' + response.json()['access_token']
print(token[0:25]+'...')


<Response [200]>
Bearer eyJhbGciOiJSUzI1Ni...


## Step 2 - get concept for Acetaminophen

search for concept of `Acetaminophen` aka `Paracetamol` in `chebi`

In [7]:
response = requests.get(
    temowl_uri+'terminologies/chebi/concepts/search?',
    params={'exact': 'false', 'q': 'Paracetamol', 'lang': 'en', 'page': '0', 'size': '10'},
    headers={'Accept': 'application/json'},
)

json_response = response.json()
acetaminophen = json_response['content'][0]['conceptID']
acetaminophen

'chebi:46195'

got it: `Acetaminophen` is `chebi:46195`.

now retrieve details on `Paracetamol`:

In [8]:
response = requests.get(
    temowl_uri+'terminologies/chebi/concepts/'+urllib.parse.quote(acetaminophen),
    params={'lang': 'en', 'page': '0', 'size': '10'},
    headers={'Accept': 'application/json'},
)
json_response = response.json()
json_response['description']['description']


'A member of the class of phenols that is 4-aminophenol in which one of the hydrogens attached to the amino group has been replaced by an acetyl group.'

fetch some synonyms in order to expand the search query

In [9]:
response = requests.get(
    temowl_uri+'terminologies/chebi/concepts/'+urllib.parse.quote(acetaminophen)+'/labels',
    params={'lang': 'en', 'page': '0', 'size': '10'},
    headers={'Accept': 'application/json'},
)

full_text_query = '( Acetaminophen'

json_response = response.json()
for synonym in json_response['content']:
    if '(' not in synonym['name'] :
        full_text_query += ' OR ' + synonym['name']
    
full_text_query += ' )'
full_text_query

'( Acetaminophen OR Paracetamol OR paracetamol )'

## Step 3 - get documents on Acetaminophen

> **to be discussed** - here we are using the different Fraunhofer pre-indexed document collections, we could tap into your collections
> * which document collections do you have?
> * how are they indexed and searchable?
> * how do we access them?
> * who is the technical contact there?

### define some helper functions

fetch a pmid and render HTML

In [12]:
def fetchDocumentAndRenderHTML( documentID, corpusID ):
   "fetch document with pmid and return HTML"

   uri = scaiview_uri+'corpora/' + urllib.parse.quote(corpusID) + '/documents/' + urllib.parse.quote(documentID)

   try:
      response = requests.get(
        uri,
        headers={'Accept': 'text/html', 'Authorization': token},
      )
      return HTML(response.text)
    
   except:
      print(uri, response)
      return HTML("empty")

search for documents

In [10]:
def fetchCorpusID( corpusName ):
    "fetch a corpus id via name"
    
    uri = scaiview_uri+'corpora/'
    corpusId = ''
    
    try:
        response = requests.get(
            uri,
            params={'size': '1000', 'page':'0'},
            headers={'Accept': '*/*', 'Authorization': token},
        )       
 
        corpora = response.json()['content']
        
        for x in range(0, len(corpora)):
            if corpora[x]['name'] == corpusName:
                corpusId = corpora[x]['id']
                
        print(corpusId)
        return corpusId
        
    except:
       print(uri, response)
       return ""

In [11]:
corpusID = fetchCorpusID('academia')

academia_2019_v_1_0_1


In [64]:
def fulltextQuery(keywords =['human', 'cancer'], recall =True ):
    "create a fulltext query"
    
    if recall==True:
        operator = "OR"
    else:
        operator = "AND"
        
    query = {
        "operator": operator,
        "searchedFulltext": keywords
    }
    
    return query

In [106]:
def semanticQuery(concepts =['mesh:D002555','mesh:D001921'], recall =True ):
    "create a semantic query"
    
    if recall==True:
        operator = "OR"
    else:
        operator = "AND"
        
    query = {
        "operator": operator,
        "searchedConcepts": concepts
    }
    
    return query

In [109]:
def mixedQuery(keywords =['human', 'cancer'], concepts =['mesh:D002555','mesh:D001921'], recall =True ):
    "create a mixed query"
    
    if recall==True:
        operator = "OR"
    else:
        operator = "AND"
        
    query = {
        "operator": operator,
        "searchedFulltext": keywords,
        "searchedConcepts": concepts
    }
    
    return query

In [87]:
def fetchDocumentsByQuery( query, corpusID, limit='10' ):
    "fetch documents by query return json array"

    try:
        url = scaiview_uri+'corpora/'+corpusID+'/search/documents?size='+limit
        payload = json.dumps(query)
        headers = {'Accept': '*/*', 'Authorization': token,  'Content-Type': 'application/json'}

        print(url)
        
        r = requests.post(url, data=payload, headers=headers)
        documents = r.json()
    
        print ('got: ' , len(documents['content']), ' documents')
        return documents
    
    except:
    
        print('error')
        print(payload)
        print(r.text)
        return ""

In [16]:
def getTitleOfDocument(documents, i=0):
    "return the title of a document in a result list"
    
    return documents['content'][i]['documentElement']['metaElement']['bibliographic']['title']['titleText']['text']

In [56]:
def getIdOfDocument(documents, i=0):
    "return the id of a document in a result list"

    return documents['content'][i]['documentElement']['metaElement']['concept']['identifier']['text']

In [141]:
def fetchDocumentsByQueryAsPD( query, collection, limit = '10', sorted=True ):
    "fetch a list of documents matching the query, return as pd"

    json_response =  fetchDocumentsByQuery( query, collection, limit )
    
    df = pd.DataFrame([], columns = ['PMID','Title'])

    try: 
        documents = json_response
        numFound = len(json_response['content'])
        for x in range(0, numFound):
            df = df.append(pd.DataFrame([[getIdOfDocument(documents,x), getTitleOfDocument(documents,x)]], index = [x+1], columns = ['PMID','Title']))

        if sorted==True:
           df = df.sort_values(by=['Title'], ascending=True)
    
    except:
        print(documents)
        
    return df

deprecated


In [7]:
def textMineDocument(ids, source, target, queue):
   "fetch document with pmid and send it to queue to be annotated"
   response = requests.post(
        scaiview_uri+'v3/fetchAndQueue?',
        params={'ids': ids, 'sourceCollection': source, 'targetCollection': target, 'targetQueue': queue },
        headers={'Accept': 'application/json', 'Authorization': 'Bearer '+token},
   )
   return response


### query for documents for Acetaminophen


*precision based* - with free text search

In [152]:
query = fulltextQuery(['acetaminophen', 'IARC'], False)
df = fetchDocumentsByQueryAsPD( query, corpusID, '250')
df

https://api.scaiview.com/api/v5/corpora/academia_2019_v_1_0_1/search/documents?size=250
got:  2  documents


Unnamed: 0,PMID,Title
1,26551927,A quantum chemical study of the reactivity of ...
2,10755416,Non-steroidal anti-inflammatory drugs and blad...


In [153]:
fetchDocumentAndRenderHTML('PMID:'+df.loc[2, 'PMID'], corpusID)


In [150]:
query = fulltextQuery(['acetaminophen', 'carcinogen', 'human'], False)
df = fetchDocumentsByQueryAsPD( query, corpusID, '250')
df


https://api.scaiview.com/api/v5/corpora/academia_2019_v_1_0_1/search/documents?size=250
got:  8  documents


Unnamed: 0,PMID,Title
5,12570342,Anticarcinogenicity of monocyclic phenolic com...
4,15120964,Comparison of basal gene expression profiles a...
8,1910589,Critical considerations in the immunochemical ...
3,23723564,Development of a Medium-term Animal Model Usin...
1,23894158,Highly selective bioactivation of 1- and 2-hyd...
6,9370098,Inhibition by acetaminophen of intestinal canc...
7,7866987,Potential genoprotective role for UDP-glucuron...
2,23479080,Selective poisoning of Ctnnb1-mutated hepatoma...


In [146]:
fetchDocumentAndRenderHTML('PMID:'+df.loc[1, 'PMID'], corpusID)


In [148]:
query = fulltextQuery(['Acetaminophen', 'cancer', 'human'], False)
df = fetchDocumentsByQueryAsPD( query, corpusID, '250')
df

https://api.scaiview.com/api/v5/corpora/academia_2019_v_1_0_1/search/documents?size=250
got:  51  documents


Unnamed: 0,PMID,Title
40,9928667,Acetaminophen alters estrogenic responses in v...
9,23526585,Acetaminophen attenuates doxorubicin-induced c...
31,11397560,Acetaminophen elicits anti-estrogenic but not ...
10,23749887,Acetaminophen enhances cisplatin- and paclitax...
30,12044058,Acetaminophen modulations of chemotherapy effi...
36,10902853,Acetaminophen selectively reduces glioma cell ...
13,21371442,Acetaminophen-induced differentiation of human...
38,10079213,Acetaminophen-induced proliferation of breast ...
33,11192938,Anti-microinflammatory lipid signals generated...
17,19306902,Anti-oxidants for therapeutic use: why are onl...


*precision based* - using indexed MeSH field

In [143]:
query = semanticQuery(['mesh:D000082', 'mesh:D002273'], False)
df = fetchDocumentsByQueryAsPD( query, corpusID, '250')
df

https://api.scaiview.com/api/v5/corpora/academia_2019_v_1_0_1/search/documents?size=250
got:  8  documents


Unnamed: 0,PMID,Title
2,1902451,Age-dependent induction of preneoplastic liver...
7,853543,Covalent binding of foreign chemicals to tissu...
8,1000507,Effect of hepatocarcinogens on the binding of ...
5,2477123,"Effects of tumor promoters, genotoxic carcinog..."
6,2860980,Induction of gamma-glutamyl transpeptidase in ...
3,2781555,Induction of hepatic metallothionein in male B...
4,3106031,Microsomal ethanol-oxidizing system.
1,1344831,[Omeprazole and liver functions].


mixing freetext and concepts

In [123]:
query = mixedQuery(['carcinogen', 'human'],['chebi:2386'], False)
df = fetchDocumentsByQueryAsPD( query, corpusID, '25')
df


https://api.scaiview.com/api/v5/corpora/academia_2019_v_1_0_1/search/documents?size=25
got:  1  documents


Unnamed: 0,PMID,Title
1,1910589,Critical considerations in the immunochemical ...


In [122]:
fetchDocumentAndRenderHTML('PMID:'+df.loc[1, 'PMID'], corpusID)


### Step 4 - mine full text documents

let's start a search on open access articles from PMC


In [11]:
documents = querySCAIViewAsDocuments('document:paracetamol OR document:carcinogen*', 'PMC_2019', 25)
data = pd.io.json.json_normalize(documents)
data

document:paracetamol OR document:carcinogen*: 607000 found


Unnamed: 0,_version_,abstract,authors,date,docType,documentIdentifiers,id,journal,language,source,title
0,1624645043740475392,[CRISPR-Cas9 and Cas12a (Cpf1) nucleases are t...,"[Keunsub Lee, Yingxiao Zhang, Benjamin P. Klei...",1583078251468,[PMC_FULLTEXT],"[PMID:29972722, DOI:10.1111/pbi.12982, MANUSCR...",PMCID:PMC6320322,Plant Biotechnol. J.,[EN],Plant Biotechnol. J.,[Activities and specificities of CRISPR-Cas9 a...
1,1624644414228922368,[Perivascular accumulation of lymphocytes can ...,"[Alexander M. S. Barron, Julio C. Mantero, Jon...",1580572051111,[PMC_FULLTEXT],"[PMID:30510068, DOI:10.4049/jimmunol.1801209, ...",PMCID:PMC6305793,J. Immunol.,[EN],J. Immunol.,[Perivascular Adventitial Fibroblast Specializ...
2,1624644053385609216,[Human embryonic stem cell-derived cardiomyocy...,"[Ana De La Mata, Sendoa Tajada, Samantha O’Dwy...",1580571707008,[PMC_FULLTEXT],"[PMID:30353632, DOI:10.1002/stem.2927, MANUSCR...",PMCID:PMC6312737,Stem Cells,[EN],Stem Cells,[BIN1 induces the formation of T-tubules and a...
3,1624637779106332672,[Many bacterial species use the MecA/ClpCP pro...,"[M. Son, J. Kaspar, S.J. Ahn, R.A. Burne, S.J....",1577887323393,[PMC_FULLTEXT],"[PMID:29873131, DOI:10.1111/mmi.13992, MANUSCR...",PMCID:PMC6281771,Mol. Microbiol.,[EN],Mol. Microbiol.,[Threshold regulation and stochasticity from t...
4,1624637616277159936,[Allogeneic stem cell transplantation (allo-HC...,"[Tomomi Toubai, Hiroya Tamaki, Daniel C. Pelti...",1577887168102,[PMC_FULLTEXT],"[PMID:30389773, DOI:10.4049/jimmunol.1800148, ...",PMCID:PMC6240608,J. Immunol.,[EN],J. Immunol.,[Mitochondrial deacetylase SIRT3 plays an impo...
5,1624638250831314944,[Inhibition of vascular endothelial growth fac...,"[Aaron B. Simmons, Colin A. Bretz, Haibo Wang,...",1575209373251,[PMC_FULLTEXT],"[PMID:29730824, DOI:10.1007/s10456-018-9618-5,...",PMCID:PMC6203654,Angiogenesis,[EN],Angiogenesis,[Gene therapy knockdown of VEGFR2 in retinal e...
6,1624636802946039808,[The combined inhibition of histone deacetylas...,"[Alexander Badamchi-Zadeh, Kelly D. Moynihan, ...",1575207992455,[PMC_FULLTEXT],"[PMID:30249811, DOI:10.4049/jimmunol.1800885, ...",PMCID:PMC6196294,J. Immunol.,[EN],J. Immunol.,[Combined HDAC and BET inhibition enhances mel...
7,1624635234262712320,[Several heat shock proteins (HSP) prime immun...,"[Yifei Wang, Abigail L. Sedlacek, Sudesh Pawar...",1573824096442,[PMC_FULLTEXT],"[PMID:30209191, DOI:10.4049/jimmunol.1800505, ...",PMCID:PMC6176107,J. Immunol.,[EN],J. Immunol.,"[The Heat Shock Protein, gp96, activates infla..."
8,1624639715170844672,[Ciliary and flagellar motility is caused by t...,[Stephen M. King],1571232769765,[PMC_FULLTEXT],"[PMID:30176122, DOI:10.1002/cm.21483, MANUSCRI...",PMCID:PMC6249098,Cytoskeleton (Hoboken),[EN],Cytoskeleton (Hoboken),[Turning Dyneins Off Bends Cilia]
9,1624643430005080064,"[Proteinase 3 (P3), a serine protease expresse...","[Tian-Hui Yang, Lisa S. St John, Haven R. Garb...",1569940312504,[PMC_FULLTEXT],"[PMID:30021768, DOI:10.4049/jimmunol.1800324, ...",PMCID:PMC6099529,J. Immunol.,[EN],J. Immunol.,[Membrane-associated proteinase 3 on granulocy...



Here is an interesting document, which we would like to analyze further:


In [12]:
fetchDocumentAndRenderHTML('PMCID:PMC2018698', 'PMC_2019')


Now process this document via text mining:
* identify sections and sentences
* tagg chemical compounds and hypotheses

In [13]:
status = textMineDocument('PMCID:PMC2018698', 'PMC_2019', 'test', 'chemical-pipeline')
status.text

'Successfully retrieved and queued all documents'

load the precomputed text mining result:

### filter documents for relevant sentences and display them ...

goal to classify compounds into [Agents Classified by the IARC Monographs](https://monographs.iarc.fr/agents-classified-by-the-iarc/) :
* Group 1 - Carcinogenic to humans
* Group 2A - Probably carcinogenic to humans
* Group 2B - Possibly carcinogenic to humans
* Group 3 - Not classifiable as to its carcinogenicity to humans

**Tool used: JProMiner with the following terminologies**
* DrugBank (drugs)
* Homo_sapiens (genes and proteins)
* ATC (drug classes)
* BAO (assays) 
* HypothesisFinder (speculative statements)



In [14]:
document = json.loads(open('PMCID:PMC2018698.json').read())

sentences = []
annotations = []

for section in document['documentElement']['bodyMatter']['sections']:
    for paragraph in section['paragraphs']:
        for strucElem in paragraph['structureElements']:
            sentence = strucElem['sentence']['text']['text']
            sentTannotations = strucElem['sentence']['text']['annotations']
            hypo = False
            acetaminophen = False
            for annotation in sentTannotations:
                if (annotation['annotationType'] == "hypothesis"):
                    hypo = True
                if (annotation['annotationType'] == "CHEBI") and annotation['annotationText'] == '2386':
                    acetaminophen = True
            if sentence not in sentences and hypo == True and acetaminophen == True:
                sentences.append(sentence)
                annotations.append(sentTannotations)
                  
sdf = pd.DataFrame(data = {'sentence': sentences, 'annotation': annotations})
sentences




['While some studies of bladder cancer found evidence of an elevated risk associated with heavy use of paracetamol, the majority did not, and some suggested an overall decreased risk [[<6>,<8>,<10>,<11>,<13>-<15>,<19>], additional file<1>].',
 'Our data further support an etiologic role of phenacetin in bladder cancer occurrence and they further suggest that risk increases with duration of use.Paracetamol is a metabolite of phenacetin, but it is unclear whether paracetamol retains the carcinogenic potential of its parent compound.',
 'Paracetamol is not a potent inhibitor of cyclooxygenase (COX), but may inhibit NFkB, a transcription factor related to the inhibition of apoptosis [<29>], up-regulated in several cancers, including bladder cancer [<30>].',
 'Metabolism of paracetamol results in a reactive metabolite (N-acetyl-P-benzoquinone imine (NAPQI)) that can form DNA adducts [<31>] and cause liver and renal toxicity [<32>].',
 'Thus, paracetamol, in theory, could promote apoptosis t