# Scraping the medical literature to add causal relationships among UMLS concepts in the graph database

## The Problem

![](images/Causal_expansion1.png)

The picture above illustrates one powerful use of causal relationships among diseases and their downstream effects. We can see a chain of causation that flows from the use of methamphatamines to death.

![](images/Causal_expansion2.png)

If we look at the longest path between methamphetamine use and death, we see the most detailed cause-effect chain known. In this example, there is an opportunity to intervene at each point in the cause-effect chain, and in a typical patient who is experiencing heart failure due to methamphetamine use, we do intervene at multiple points as shown here. 

![](images/Concept_nodes_example.png)

The Unified Medical Language System (UMLS) has collected about 4.3 million medical concepts, which have been imported as nodes in our working group's graph database. There are some data sources which specify some relationships among these nodes, but so far we have not yet found any data source which shows direct <strong>causal</strong> relationships among them. 

## Mission
<strong>Scrape the world's medical literature to find causal relationships among UMLS concepts and add the relationships to the graph.</strong>

Example:

Starting with an input string like this:  
>'The autopsy showed no evidence of osteosarcoma, and the likely cause of death was cardiac failure with the evidence of pulmonary congestion, liver congestion, and multiple body cavity effusions.'  

Do some of this magic:  
![](images/sentence_parsing.svg)  
Source: https://allenai.github.io/scispacy/

And output a table that looks something like this:  

|Concept_1|Relationship|Concept_2|Source PMID|  
|---|---|---|---|  
|cardiac failure|CAUSES|death|33554025|  

## Helpful tools

### Access NCBI API to get causal strings

The following endpoints are provided by the National Center for Biotechnology Information([NCBI](https://www.ncbi.nlm.nih.gov/))

We'll be using the [Esearch](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch) utility to get a list of publication ID numbers for articles containing a causal relationship of interest.

We'll then use the [Efetch](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch) utility to fetch articles identified by the previously identified IDs of interest. 



In [4]:
import requests
from bs4 import BeautifulSoup
import json
import re
import urllib.parse
import pandas as pd
import time

In [5]:
# To get the format for search query that can be passed into a URL, 
# perform an advanced search at pubmed, then copy what follows the &term= from that search's URL
query = '(((((((cause[Title/Abstract]) NOT (all-cause[Title/Abstract])) ) ) ) OR (resulting in[Title/Abstract])) OR (due to[Title/Abstract])) AND (respiratory failure[Title/Abstract])'
query = urllib.parse.quote(query, safe='') # Encode the query in URL format

# Get a list of PMIDs
# Pass the query into the ESearch utility to get a list of PMIDs
ESearch_base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=200&term='
url = ESearch_base + query

print(url)
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content, 'html.parser')
PMIDs_str = soup.idlist.get_text()
PMIDs_str = PMIDs_str.replace('\n',',')
PMIDs_str = PMIDs_str[1:-1]

# Check the PMIDs_str for proper format
PMIDs_str

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=200&term=%28%28%28%28%28%28%28cause%5BTitle%2FAbstract%5D%29%20NOT%20%28all-cause%5BTitle%2FAbstract%5D%29%29%20%29%20%29%20%29%20OR%20%28resulting%20in%5BTitle%2FAbstract%5D%29%29%20OR%20%28due%20to%5BTitle%2FAbstract%5D%29%29%20AND%20%28respiratory%20failure%5BTitle%2FAbstract%5D%29


'34117075,34116345,34116002,34115044,34113763,34112941,34112275,34106648,34104895,34104863,34101983,34101597,34094777,34094607,34093735,34093730,34093569,34092904,34090304,34087432,34078721,34078682,34078681,34075388,34071924,34071255,34068847,34066226,34064600,34062958,34061274,34058704,34056135,34055113,34050768,34048158,34046484,34046161,34045812,34044459,34044293,34043674,34041194,34035018,34031351,34029936,34028327,34026547,34026390,34026386,34016056,34014058,34014017,34012970,34012519,34011775,34010072,34009036,34007787,34006594,34001839,34001586,33998884,33998306,33995679,33995419,33995053,33994405,33993599,33990007,33988053,33987115,33979116,33978174,33976894,33976619,33976011,33975901,33975843,33975405,33974881,33974311,33969909,33969082,33968219,33967443,33966588,33966260,33965289,33965265,33965151,33964038,33959241,33959020,33957150,33953510,33951650,33950987,33950951,33950887,33949088,33948780,33948421,33941257,33941249,33940030,33937729,33936710,33934148,33932971,33930089,

In [7]:
# Fetch the abstract for each PMID on the list
url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={PMIDs_str}&retmode=xml&rettype=abstract'

url = url.format(PMIDs_str=PMIDs_str)
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content, 'html.parser')
count = 0
output_dict = {}
PMID_list = PMIDs_str.split(',')

regex = r"([^.\n]*?[^-]due to[^.]*\.[^0-9])"

for pmid in PMID_list:
    abstract = soup.find_all('abstract')[count].get_text()
    sentence_list = re.findall(regex, abstract)
    if len(sentence_list) > 0:
        output_dict[pmid] = sentence_list
        print(output_dict[pmid], pmid)
    count += 1

# Deal with negatives (e.g. "this does not cause that")

[' We hypothesize that pulmonary hypertension-related strain on the right ventricle due to lung disease, may have led to the observed delay in the recovery of RV function, despite the full recovery of LV function.\n'] 34113763
['Patients with obesity are at increased risk of severe COVID-19, requiring mechanical ventilation due to acute respiratory failure. '] 34112941
[' In this article, we present a newborn who required extracorporeal membrane oxygenation (ECMO) support for acute respiratory failure in the early postoperative period due to exposure to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) after aortic arch repair and ventricular septal defect closure. '] 34112275
['To our knowledge, this is one of the first cases to be reported in the literature on the use of awake extracorporeal membrane oxygenation as a "treatment" for barotrauma due to severe acute respiratory distress syndrome in a coronavirus disease 2019 patient, without the need for invasive mechanical v

IndexError: list index out of range

### NLP toolkits

Special thanks to Kevin Obuya for compiling this list:  
https://docs.google.com/spreadsheets/d/13JADjvvbytmJCZ4l9IxmG8MblFYZYWME9vCldM_EKlA/edit?usp=sharing