# Scraping the medical literature to add causal relationships among UMLS concepts in the graph database

Given a graph database containing all UMLS concepts as nodes, use medical texts to find all causal relations among all UMLS concepts and add these relations as formal relationships in the graph. 

Approach 1: Start search with query with UMLS strings linked to the concept "Etiology aspects" (UMLS CUI: C0015127)
Search a medical text and return every sentence containing a UMLS string that points to the UMLS concept "Etiology aspects." Identify the subject and the object of the causal verb. Search the list of UMLS strings for the subject and object. When both subject and object of a sentence exist in the UMLS strings, write cypher code that MATCHes those UMLS strings and their related :Concepts and MERGEs a new :CAUSES relationship between the UMLS concepts. Save the Pubmed ID of the source of the causal information and the sentence describing the causal relationship as properties of the CAUSES relationship.


- For each concept in the UMLS, use Pubmed's ESearch utility to search for the MeSH term and any of the narrower concepts of the [Linkage Concept (CUI C0332280)](https://uts.nlm.nih.gov/uts/umls/concept/C0332280) or any of the concepts under [functionally_related_to](https://www.nlm.nih.gov/research/umls/META3_current_relations.html) in the semantic network. This returns a list of PMIDs.
- Pass the list of PMIDs into the EFetch utility to get a list of text abstracts
- Identify the object of any causal verb and search a list of distinct UMLS strings for the object
- For any matches, write and execute cypher code to match the concept nodes connected to the subject and object, and MERGE a :CAUSES relationhip with properties of source: Pubmed, pmid, and the sentence in which the causal relationship was stated. 
- Have human experts review the graph to find inappropriate connections and revise the code to avoid making such connections.

The NCBI's E-utilities can be used to automate Pubmed searches.

Minimizing the Number of Requests  [Source](https://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.The_Nine_Eutilities_in_Brief)  
If a task requires searching for and/or downloading a large number of records, it is much more efficient to use the Entrez History to upload and/or retrieve these records in batches rather than using separate requests for each record. Please refer to Application 3 in Chapter 3 for an example. Many thousands of IDs can be uploaded using a single EPost request, and several hundred records can be downloaded using one EFetch request.

[Details on EFetch](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch)

PMIDs_str = []
url = '''https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={PMIDs}&retmode=text&rettype=abstract'''.format(PMIDs=PMIDs_str)

In [2]:
import requests
from bs4 import BeautifulSoup
import json
import re
import urllib.parse
import pandas as pd
import time

In [3]:
# To get the format for search query that can be passed into a URL, 
# perform an advanced search at pubmed, then copy what follows the &term= from that search's URL

# Get a list of PMIDs
# Pass the query into the ESearch utility to get a list of PMIDs

query = '(((((((cause[Title/Abstract]) NOT (all-cause[Title/Abstract])) ) ) ) OR (resulting in[Title/Abstract])) OR (due to[Title/Abstract])) AND (respiratory failure[Title/Abstract])'
query = urllib.parse.quote(query, safe='') # Encode the query in URL format
ESearch_base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=200&term='
url = ESearch_base + query

print(url)
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content, 'html.parser')
PMIDs_str = soup.idlist.get_text()
PMIDs_str = PMIDs_str.replace('\n',',')
PMIDs_str = PMIDs_str[1:-1]

# Check the PMIDs_str for proper format
PMIDs_str

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=200&term=%28%28%28%28%28%28%28cause%5BTitle%2FAbstract%5D%29%20NOT%20%28all-cause%5BTitle%2FAbstract%5D%29%29%20%29%20%29%20%29%20OR%20%28resulting%20in%5BTitle%2FAbstract%5D%29%29%20OR%20%28due%20to%5BTitle%2FAbstract%5D%29%29%20AND%20%28respiratory%20failure%5BTitle%2FAbstract%5D%29


'33794205,33793086,33791177,33791101,33790521,33790512,33788191,33788015,33786448,33785355,33783269,33782861,33782774,33781349,33780519,33779386,33778090,33777571,33776717,33776431,33769275,33769103,33768630,33768195,33766961,33766794,33764182,33762493,33760464,33758887,33758161,33758150,33754916,33754088,33752392,33751131,33750741,33750338,33748247,33747786,33747760,33747592,33747413,33744911,33743806,33741569,33739956,33735661,33731006,33729129,33728514,33728168,33728071,33727299,33724365,33722271,33721137,33720607,33717751,33717368,33716310,33711919,33710610,33709528,33709318,33706592,33705348,33704883,33693057,33691378,33688576,33688440,33687672,33687180,33686984,33686492,33685769,33681677,33681257,33681097,33680800,33680448,33679753,33679254,33678770,33678052,33676105,33676091,33672672,33670462,33670260,33666909,33666682,33666071,33666070,33665764,33664959,33664810,33663958,33663129,33658444,33657294,33655986,33655275,33653913,33653908,33651923,33651250,33648989,33646336,33645461,

In [4]:
# Fetch the abstract for each PMID on the list
url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={PMIDs_str}&retmode=xml&rettype=abstract'
# url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=101772813&retmode=text&rettype=abstract'

url = url.format(PMIDs_str=PMIDs_str)
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content, 'html.parser')
count = 0
output_dict = {}
PMID_list = PMIDs_str.split(',')

regex = r"([^.\n]*?[^-]cause of[^.]*\.[^0-9])"

for pmid in PMID_list:
    abstract = soup.find_all('abstract')[count].get_text()
    sentence_list = re.findall(regex, abstract)
    if len(sentence_list) > 0:
        output_dict[pmid] = sentence_list
        print(output_dict[pmid], pmid)
    count += 1

# Deal with negatives (e.g. "this does not cause that")

['Influenza virus, rhinovirus, and adenovirus frequently cause viral pneumonia, an important cause of morbidity and mortality especially in the extreme ages of life. ', ' In conclusion, viral pneumonia is a relevant cause of CAP, whose interest is increasing due to the current COVID-19 outbreak. ', 'To set up a therapeutic approach is difficult because of the low number of active molecules and the conflicting data bearing supportive treatments such as steroids.\n'] 33782861
['Acute Respiratory Distress Syndrome (ARDS) is a frequent cause of respiratory failure in intensive care unit (ICU) patients and results in significant morbidity and mortality. '] 33779386
['Impaired immune response has been reported to be the cause of the development of coronavirus disease 2019 (COVID-19)-related respiratory failure. '] 33778090
[' This approach allowed us to diagnose the cause of acutely rising transaminases in a patient in severe ARDS secondary to influenza pneumonia requiring veno-venous extrac

IndexError: list index out of range

In [314]:
text = output_dict['33554025'][0]
print(text)

 The autopsy showed no evidence of osteosarcoma, and the likely cause of death was cardiac failure with the evidence of pulmonary congestion, liver congestion, and multiple body cavity effusions.



Use UMLS_2020AB.ipynb to create a CSV with all unique strings in the UMLS and their respective CUIs. Move that CSV into the folder where this Jupyter notebook is saved.

In [293]:
str_to_CUI = pd.read_csv('str_to_CUI.csv', encoding='utf-8')
str_to_CUI.dropna(inplace=True)
str_to_CUI = str_to_CUI[~str_to_CUI['STR'].str.contains('cause')]

In [294]:
str_to_CUI.tail()

Unnamed: 0,STR,CUI
13058669,ﾜﾝﾌｶｲｶﾝ,C0877610
13058670,ﾜﾝﾍﾝｹｲ,C0919717
13058671,ﾜﾝﾍﾝｹｲNOS,C0919717
13058672,ﾜﾝﾎｳｿｳｴﾝ,C0562422
13058673,ﾜﾝﾚｯｼｮｳ,C0432974


In [307]:
# Define a function that conducts a fast binary search on a sorted column of a dataframe, returning only full match results.

def binary_search(dataframe, column, target):
    range_start = 0
    range_end = len(dataframe)-1
    while range_start < range_end:
        range_middle = (range_end + range_start) // 2
        value = dataframe.iloc[range_middle][column]
        if value == target:
            return dataframe.iloc[range_middle]
        elif value < target:
            # Discard the first half of the range
            range_start = range_middle + 1
        else:
            # Discard the second half of the range
            range_end = range_middle - 1
    # At this point range_start = range_end
    value = dataframe.iloc[range_start][column]
#     return value
    if value == target:
        return dataframe.iloc[range_start]
    else:
        return 0

# Test the function
start_time = time.time()
frame = binary_search(dataframe = str_to_CUI, column = 'STR', target = '心筋梗塞')
if type(frame) == int:
    print("No match")
else:
    print(frame['CUI'])
print("Runtime:", time.time() - start_time, "seconds")

C0027051
Runtime: 0.0035467147827148438 seconds


In [331]:
# Define a function which takes a string and returns a list of CUIs or the strings associated with CUIs

def text_to_CUIs(text):
    
    # Remove any non-alphanumeric characters, set the encodning to unicode, and split the text into a list of words
    text = re.sub('[\W_]+', ' ', text, flags=re.UNICODE)
    text = text.split(' ')

    # Iterate through the list of words to find the largest sets of consecutive words that match CUI-associated strings, and append these to a term list
    used = set([])
    term_list = []
    for i in reversed(range(1,6)):
        index = 0
        while index < len(text):
            if not index in used:
                term = ' '.join(text[index:(index+i)])
                frame = binary_search(dataframe = str_to_CUI, column = 'STR', target = term)
                if type(frame) == int:
                    index += 1
                else:
                    used.update(range(index, index+i))
                    term_list.append([index, term, frame['CUI']])
                    index += i
            else:
                index += 1
    
    # Append any non-matched words to the term list
    index = 0
    for word in text:
        if not index in used:
            term_list.append([index, word])
        index += 1

    # Sort the term list according to the order of the terms in the original text
    term_list = sorted(term_list, key=lambda x: x[0])
    
    return term_list

# Run a test query
test_text = ' On the other hand, the main cause of death in patients with tracheostomy invasive ventilation was respiratory infection, which was noted in 26 of 82, while other causes varied. '
start_time = time.time()

term_list = text_to_CUIs(test_text)

print("Runtime:", (time.time() - start_time), "seconds")

print(test_text)
# print([x[2] if len(x) == 3 else x[1] for x in term_list])
outlist = []
for word in term_list:
    if len(word) == 3:
        outlist.append('('+word[1]+' '+word[2]+')')
    else:
        outlist.append(word[1])
print(' '.join(outlist))

Runtime: 0.2151045799255371 seconds
 On the other hand, the main cause of death in patients with tracheostomy invasive ventilation was respiratory infection, which was noted in 26 of 82, while other causes varied. 
 (On C1720176) the (other C0237094) (hand C0018563) the (main C0205225) cause of (death C0011065) (in C0021223) (patients C0030705) with (tracheostomy C0040590) (invasive C0205281) (ventilation C0035203) was (respiratory infection C0035243) which was noted (in C0021223) (26 C0227067) of (82 C3641023) (while C0750519) (other C0237094) causes varied 


In [292]:
str_to_CUI[~str_to_CUI['STR'].str.contains('cause')]

Unnamed: 0,STR,CUI
0,""""" w/o Surgery Capability",C1548830
1,Debulking (résection) de tumeur,C0439805
2,Wet prep positif,C0861028
3,!Orthotrichum mandonii,C5257799
4,"!Orthotrichum mandonii Schimp. ex Hampe, 1865",C5257799
...,...,...
13058669,ﾜﾝﾌｶｲｶﾝ,C0877610
13058670,ﾜﾝﾍﾝｹｲ,C0919717
13058671,ﾜﾝﾍﾝｹｲNOS,C0919717
13058672,ﾜﾝﾎｳｿｳｴﾝ,C0562422


In [254]:

# remove adjacent duplicates
# generalize process to define a function that can be used to convert any string into an ordered list of CUIs
# systematically identify subject and object of cause
# write out to graph database

In [113]:
foo = set(range(0, 4))
foo
# set([0, 1, 2, 3])
# >>> foo.update(range(2, 6))
# >>> foo
# set([0, 1, 2, 3, 4, 5])

{0, 1, 2, 3}

### Assigning directionality to the :CAUSES relationship

#### With pattern noun1-verb-noun2 (e.g. this causes that):

(noun1) - [:CAUSES] - > (noun2)
- cause of
- causes
- results in

(mulitple noun1)  - [:CAUSES] - > (noun2)
- cause
- causes of
- result in

(noun1) < - [:CAUSES] - (noun2)
- caused by
- due to
- because of

With pattern verb-noun1-noun2

findall, and include them if they are adjacent to one another and adjacent to the causality verb

### Assigning subject and object
(capture_group_1)regex_pattern(capture_group_2)  
find CUIs in capture groups and decide which CUIs to assign as subject and object:  
- filter on certain semantic types like nouns
- Maybe use last CUI in capture_group_1 and first CUI in capture_group_2? Will likely have to adjust this for each variation of the sentence structure for a causal statement

## Relationship properties
- python-formatted list of lists of PMIDs for every level of evidence
- python-formatted list of lists of the count of PMIDs for every level of evidence

Levels of evidence ([Source](https://guides.library.stonybrook.edu/evidence-based-medicine/levels_of_evidence))  

|Level|Description|  
|---|---|  
|1|Evidence from a systematic review of all relevant randomized controlled trials.|  
|2|Evidence from a meta-analysis of all relevant randomized controlled trials.|  
|3|Evidence from evidence summaries developed from systematic reviews|  
|4|Evidence from guidelines developed from systematic reviews|  
|5|Evidence from meta-syntheses of a group of descriptive or qualitative studies|  
|6|Evidence from evidence summaries of individual studies|  
|7|Evidence from one properly designed randomized controlled trial|  
|8|Evidence from nonrandomized controlled clinical trials, nonrandomized clinical trials, cohort studies, case series, case reports, and individual qualitative studies.|  
|9|Evidence from opinion of authorities and/or reports of expert committee|  
|10|Everything else|  

In [5]:
# Get a list of all publication types listed under the Pubmed advanced filter "Publication Types" by clicking the Show Index button. 
# Source: https://pubmed.ncbi.nlm.nih.gov/advanced/
all_pub_types = ["adaptive clinical trial", "biography", "address", "autobiography", "bibliography", "book illustrations", "case reports", "clinical study", "classical article", "clinical conference", "clinical trial", "clinical trial, phase i", "clinical trial protocol", "clinical trial, phase ii", "clinical trial, phase iii", "clinical trial, phase iv", "clinical trial, veterinary", "collected work", "collected works", "comment", "comparative study", "congress", "consensus development conference", "consensus development conference, nih", "controlled clinical trial", "corrected and republished article", "dataset", "dictionary", "directory", "duplicate publication", "editorial", "electronic supplementary materials", "english abstract", "ephemera", "equivalence trial", "evaluation studies", "evaluation study", "expression of concern", "festschrift", "government publication", "guideline", "historical article", "interactive tutorial", "interview", "introductory journal article", "journal article", "lecture", "legal case", "legislation", "letter", "manuscript", "meta analysis", "multicenter study", "news", "newspaper article", "observational study", "observational study, veterinary", "overall", "patient education handout", "periodical index", "personal narrative", "pictorial work", "popular work", "portrait", "practice guideline", "pragmatic clinical trial", "preprint", "publication components", "publication formats", "published erratum", "randomized controlled trial", "randomized controlled trial, veterinary", "research support, american recovery and reinvestment act", "research support, n i h , extramural", "research support, n i h , intramural", "research support, non u s gov t", "research support, u s gov t, non p h s", "research support, u s gov t, p h s", "research support, u s government", "retracted publication", "retraction of publication", "review", "scientific integrity review", "study characteristics", "support of research", "systematic review", "technical report", "twin study", "validation study", "video audio media", "webcast"]

In [7]:
# Sort publication types into the relevant evidence level 1-9. If a publication type doesn't clearly fit into one of these evidence
# levels, omit it from the evidence_level_pubtypes dictionary
evidence_level_pubtypes = {}
evidence_level_pubtypes[1] = ['systematic review']
evidence_level_pubtypes[2] = ['meta analysis']
evidence_level_pubtypes[3] = []
evidence_level_pubtypes[4] = ['practice guideline']
evidence_level_pubtypes[5] = []
evidence_level_pubtypes[6] = []
evidence_level_pubtypes[7] = ['randomized controlled trial']
evidence_level_pubtypes[8] = ['adaptive clinical trial','case reports','clinical study', 'clinical trial','clinical trial, phase i','clinical trial protocol','clinical trial, phase ii','clinical trial, phase iii','clinical trial, phase iv','comparative study','controlled clinical trial','equivalence trial', 'multicenter study','observational study','pragmatic clinical trial']
evidence_level_pubtypes[9] = ['clinical conference','congress','consensus development conference','consensus development conference, nih','government publication','guideline']
evidence_level_pubtypes["retracted"] = ['retracted publication','retraction of publication']

evidence_level_pubtypes.keys()

dict_keys([1, 2, 3, 4, 5, 6, 7, 8, 9, 'retracted'])

In [None]:
# Define a function that takes a publication type and returns the level of evidence for that publication type


Publication types not used in the evidence_level_pubtypes dictionary:

 'biography',
 'address',
 'autobiography',
 'bibliography',
 'book illustrations',
 
 'classical article',
 
 'clinical trial, veterinary',
 'collected work',
 'collected works',
 'comment',

 'corrected and republished article',
 'dataset',
 'dictionary',
 'directory',
 'duplicate publication',
 'editorial',
 'electronic supplementary materials',
 'english abstract',
 'ephemera',
 
 'evaluation studies',
 'evaluation study',
 'expression of concern',
 'festschrift',
 
 
 'historical article',
 'interactive tutorial',
 'interview',
 'introductory journal article',
 'journal article',
 'lecture',
 'legal case',
 'legislation',
 'letter',
 'manuscript',


 'news',
 'newspaper article',
 
 'observational study, veterinary',
 'overall',
 'patient education handout',
 'periodical index',
 'personal narrative',
 'pictorial work',
 'popular work',
 'portrait',
 
 
 'preprint',
 'publication components',
 'publication formats',
 'published erratum',
 
 'randomized controlled trial, veterinary',
 'research support, american recovery and reinvestment act',
 'research support, n i h , extramural',
 'research support, n i h , intramural',
 'research support, non u s gov t',
 'research support, u s gov t, non p h s',
 'research support, u s gov t, p h s',
 'research support, u s government',

 'review',
 'scientific integrity review',
 'study characteristics',
 'support of research',
 
 'technical report',
 'twin study',
 'validation study',
 'video audio media',
 'webcast']