# Scraping the medical literature to add causal relationships among UMLS concepts in the graph database

## The Problem

![](images/Causal_expansion1.png)

The picture above illustrates one powerful use of causal relationships among diseases and their downstream effects. We can see a chain of causation that flows from the use of methamphatamines to death.

![](images/Causal_expansion2.png)

If we look at the longest path between methamphetamine use and death, we see the most detailed cause-effect chain known. In this example, there is an opportunity to intervene at each point in the cause-effect chain, and in a typical patient who is experiencing heart failure due to methamphetamine use, we do intervene at multiple points as shown here. 

![](images/Concept_nodes_example.png)

The Unified Medical Language System (UMLS) has collected about 4.3 million medical concepts, which have been imported as nodes in our working group's graph database. There are some data sources which specify some relationships among these nodes, but so far we have not yet found any data source which shows direct <strong>causal</strong> relationships among them. 

## Mission
<strong>Scrape the world's medical literature to find causal relationships among UMLS concepts and add the relationships to the graph.</strong>

Example:

Starting with an input string like this:  
>'The autopsy showed no evidence of osteosarcoma, and the likely cause of death was cardiac failure with the evidence of pulmonary congestion, liver congestion, and multiple body cavity effusions.'  

Do some of this magic:  
![](images/sentence_parsing.svg)  
Source: https://allenai.github.io/scispacy/

And output a table that looks something like this:  

|Concept_1|Relationship|Concept_2|Source PMID|  
|---|---|---|---|  
|cardiac failure|CAUSES|death|33554025|  

## Helpful tools

### Access NCBI API to get causal strings

The following endpoints are provided by the National Center for Biotechnology Information([NCBI](https://www.ncbi.nlm.nih.gov/))

We'll be using the [Esearch](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch) utility to get a list of publication ID numbers for articles containing a causal relationship of interest.

We'll then use the [Efetch](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch) utility to fetch articles identified by the previously identified IDs of interest. 



In [1]:
import requests
from bs4 import BeautifulSoup
import json
import re
import urllib.parse
import pandas as pd
import time

In [3]:
# get the list of databases availed by ncbi

eutility_db_names = []
db_names = []
uid_common_names = []
databases_url = 'https://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.T._entrez_unique_identifiers_ui/?report=objectonly'
databases_url_response = requests.get(databases_url)
databases_url_content = databases_url_response.content

databases_soup = BeautifulSoup(databases_url_content, 'html.parser')
databases_table = databases_soup.select_one('#__chapter2\.T\._entrez_unique_identifiers_ui_lrgtbl__ > table:nth-child(1) > tbody:nth-child(2)')
number_of_databases = len(databases_table.findAll('tr'))

for row in range(1, number_of_databases+1):
    db_name = databases_table.select_one('tr:nth-child({}) > td:nth-child(1)'.format(row))
    db_names.append(db_name.text)
    uid_common_name = databases_table.select_one('tr:nth-child({}) > td:nth-child(2)'.format(row))
    uid_common_names.append(uid_common_name.text)
    eutility_db_cell = databases_table.select_one('tr:nth-child({}) > td:nth-child(3)'.format(row))
    eutility_db_names.append(eutility_db_cell.text)

print('database names: {}'.format(db_names))
print('e-utility database names: {}'.format(eutility_db_names))

database names: ['BioProject', 'BioSample', 'Biosystems', 'Books', 'Conserved Domains', 'dbGaP', 'dbVar', 'Epigenomics', 'EST', 'Gene', 'Genome', 'GEO Datasets', 'GEO Profiles', 'GSS', 'HomoloGene', 'MeSH', 'NCBI C++ Toolkit', 'NCBI Web Site', 'NLM Catalog', 'Nucleotide', 'OMIA', 'PopSet', 'Probe', 'Protein', 'Protein Clusters', 'PubChem BioAssay', 'PubChem Compound', 'PubChem Substance', 'PubMed', 'PubMed Central', 'SNP', 'SRA', 'Structure', 'Taxonomy', 'UniGene', 'UniSTS']
e-utility database names: ['bioproject', 'biosample', 'biosystems', 'books', 'cdd', 'gap', 'dbvar', 'epigenomics', 'nucest', 'gene', 'genome', 'gds', 'geoprofiles', 'nucgss', 'homologene', 'mesh', 'toolkit', 'ncbisearch', 'nlmcatalog', 'nuccore', 'omia', 'popset', 'probe', 'protein', 'proteinclusters', 'pcassay', 'pccompound', 'pcsubstance', 'pubmed', 'pmc', 'snp', 'sra', 'structure', 'taxonomy', 'unigene', 'unists']


In [4]:
# To get the format for search query that can be passed into a URL, 
# perform an advanced search at pubmed, then copy what follows the &term= from that search's URL
query = '(((((((cause[Title/Abstract]) NOT (all-cause[Title/Abstract])) ) ) ) OR (resulting in[Title/Abstract])) OR (due to[Title/Abstract])) AND (respiratory failure[Title/Abstract])'
query = urllib.parse.quote(query, safe='') # Encode the query in URL format

# databases to query. defaults to all ncbi databases
databases_to_query = eutility_db_names

# Get a list of PMIDs
def get_list_ofIDs(query):
    ids_dict = {}
    for db_name in databases_to_query:
        print('\n{} database:'.format(db_names[databases_to_query.index(db_name)]))
        esearch_query_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db={}&retmax=200&term={}'.format(db_name, query)
        
        print('search url: {}'.format(esearch_query_url))
        response = requests.get(esearch_query_url)
        content = response.content
        soup = BeautifulSoup(content, 'html.parser')
        try:
            ids_str = soup.idlist.get_text()
            ids_str = ids_str.replace('\n',',')
            ids_str = ids_str[1:-1]            
        except AttributeError:
            ids_str = None
        print('IDs: {}'.format(ids_str))
        if ids_str is None or ids_str.strip() == '': continue
        ids_dict[db_name] = ids_str.split(',')
    return ids_dict

ids_dict = get_list_ofIDs(query)
print(ids_dict)


BioProject database:
search url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=bioproject&retmax=200&term=%28%28%28%28%28%28%28cause%5BTitle%2FAbstract%5D%29%20NOT%20%28all-cause%5BTitle%2FAbstract%5D%29%29%20%29%20%29%20%29%20OR%20%28resulting%20in%5BTitle%2FAbstract%5D%29%29%20OR%20%28due%20to%5BTitle%2FAbstract%5D%29%29%20AND%20%28respiratory%20failure%5BTitle%2FAbstract%5D%29
IDs: 

BioSample database:
search url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=biosample&retmax=200&term=%28%28%28%28%28%28%28cause%5BTitle%2FAbstract%5D%29%20NOT%20%28all-cause%5BTitle%2FAbstract%5D%29%29%20%29%20%29%20%29%20OR%20%28resulting%20in%5BTitle%2FAbstract%5D%29%29%20OR%20%28due%20to%5BTitle%2FAbstract%5D%29%29%20AND%20%28respiratory%20failure%5BTitle%2FAbstract%5D%29
IDs: 

Biosystems database:
search url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=biosystems&retmax=200&term=%28%28%28%28%28%28%28cause%5BTitle%2FAbstract%5D%29%20NOT%20%28all-

In [5]:
# material for retmode and rettype got from 
#     https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly

db_list = ids_dict.keys()

def get_efetch_params(db_name):
    return_mode = 'xml'
    return_type = 'abstract'
    text_node = 'abstract'
    
    if db_name == 'pubmed':
        return_mode = 'xml'
        return_type = 'abstract'
        text_node = 'abstract'
    elif db_name in ['bioproject','biosystems']:
        return_type = 'xml'
        return_mode = 'xml'
    elif db_name in ['biosample', 'sra']:
        return_type = 'full'
        return_mode = 'xml'
    elif db_name == 'gds':
        return_type = 'summary'
        return_mode = 'text'
    elif db_name == 'mesh':
        return_type = 'full'
        return_mode = 'text'
    elif db_name in ['nlmcatalog','sequences']:
        return_type = None
        return_mode = 'text'
    elif db_name in ['nuccore','nucest','nucgss','protein','popset']:
        return_type = 'native'
        return_mode = 'xml'
    elif db_name == 'pmc':
        return_type = 'medline'
        return_mode = 'text'
        text_node = 'abstract'
    elif db_name in ['taxonomy', 'gene']:
        return_type = None
        return_mode = 'xml'
    elif db_name == 'gene':        
        text_node = 'Entrezgene_summary'
    elif db_name == 'snp':
        return_type = 'docset'
        return_mode = 'text'
    elif db_name == 'clinvar':
        return_type = 'clinvarset'
        return_mode = 'xml'
    elif db_name == 'gtr':
        return_type = 'gtracc'
        return_mode = 'xml'
    else:
        return_mode = None
        
    return (return_mode, return_type, text_node)
    
def get_article_content(db_name):
    article_contents = {}
    
    return_mode, return_type, text_node = get_efetch_params(db_name)
    
    if return_mode is None: 
        print('{} database not supported by efetch utility'.format(db_name))
        return None
    
    for article_id in ids_dict[db_name]:
        print("article {} of {}".format(ids_dict[db_name].index(article_id), len(ids_dict[db_name])))
        efetch_params = 'rettype={}'.format(return_type)
        if return_mode is not None: 
            efetch_params = efetch_params + '&retmode={}'.format(return_mode)
        efetch_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db={}&id={}&{}'.format(db_name, article_id, efetch_params)
        efetch_response = requests.get(efetch_url)
        efetch_content = efetch_response.content
        if return_mode == 'text': 
            article_text = efetch_content
        else:
            soup = BeautifulSoup(efetch_content, 'html.parser')
            try:
                article_text = soup.find(text_node).get_text()
            except AttributeError:
                print("no node {} at {}".format(text_node, efetch_url))
        article_contents[article_id] = article_text
    
    return article_contents

In [6]:
def extract_strings_of_interest(article_text):
    causal_phrases = ['due to', 'cause', 'resulting in']
    sentences_of_interest = []
    for causal_phrase in causal_phrases:
        regex = r"([^.\n]*?[^-]{}[^.]*\.[^0-9])".format(causal_phrase)
        sentence_list = re.findall(regex, article_text)
        sentences_of_interest = sentences_of_interest + sentence_list
    return sentences_of_interest

for db_name in db_list:
    print("{} database:".format(db_name))
    article_contents = get_article_content(db_name)
    if article_contents is None: continue
    article_ids = article_contents.keys()
    for article_id in article_ids:
        strings_of_interest = extract_strings_of_interest(str(article_contents[article_id]))
        print("strings of interest: article {} - {}\n".format(article_id, strings_of_interest))

# Deal with negatives (e.g. "this does not cause that")

gap database:
gap database not supported by efetch utility
mesh database:
article 0 of 1
strings of interest: article 67536880 - ["b'\\n1: Spinal muscular atrophy with respiratory distress 1 [Supplementary Concept]\\nA hereditary autosomal recessive form of infantile spinal muscular atrophy\\ncharacterized by severe respiratory distress resulting from diaphragmatic\\nparalysis resulting in respiratory failure between 6 weeks and 6 months,\\nDIAPHRAGMATIC EVENTRATION shown on chest x-ray or PREMATURE BIRTH, and\\npredominant involvement of the upper limbs and distal muscles. "]

pubmed database:
article 0 of 200
article 1 of 200
article 2 of 200
article 3 of 200
article 4 of 200
article 5 of 200
article 6 of 200
article 7 of 200
article 8 of 200
article 9 of 200
article 10 of 200
article 11 of 200
article 12 of 200
article 13 of 200
article 14 of 200
article 15 of 200
article 16 of 200
article 17 of 200
article 18 of 200
article 19 of 200
article 20 of 200
article 21 of 200
article 22 o

### NLP toolkits

Special thanks to Kevin Obuya for compiling this list:  
https://docs.google.com/spreadsheets/d/13JADjvvbytmJCZ4l9IxmG8MblFYZYWME9vCldM_EKlA/edit?usp=sharing