# Use Case: Which patients with Afib to anticoagulate?

Potential approaches:
- Integrate clinical practice guidelines into the graph with a way to traverse them based on patient data
- Train a graph neural network to predict which patients with AF will go on to stroke without anticoagulation
- Train a CNN to distinguish EKGs of patients who go on to cardioembolic stroke from EKGs of those who don't
- 

In [1]:
import pandas as pd
from progressbar import ProgressBar
import time

In [2]:
import getpass
password = getpass.getpass("\nPlease enter the Neo4j database password to continue \n")


Please enter the Neo4j database password to continue 
 ···············


In [3]:
from neo4j import GraphDatabase
driver=GraphDatabase.driver(uri="bolt://localhost:7687", auth=('neo4j',password))
session=driver.session()

In [31]:
# Create a full-text index that can search MIMIC-III ICD9 codes and Problem terms
command = '''
CREATE FULLTEXT INDEX Dx_in_UMLS_or_ICD9 IF NOT EXISTS
FOR (d:D_Icd_Diagnoses|Problem) 
ON EACH [d.long_title, d.description]'''
session.run(query)

<neo4j.work.result.Result at 0x7fb15860be20>

In [4]:
# Find all the patients with stroke based on Problems extracted from notes, excluding hemorrhagic strokes
query = '''
CALL db.index.fulltext.queryNodes("Dx_in_UMLS_or_ICD9", 'stroke NOT hemorrhagic') YIELD node, score
WITH COLLECT(DISTINCT(node.aui)) AS nonhemo_stroke_AUIs
MATCH (pt:Patients)-[:HAD_PROBLEM]->(prob:Problem)
WHERE prob.aui IN nonhemo_stroke_AUIs
RETURN COLLECT(DISTINCT(pt.subject_id)) AS nonhemo_stroke_pts
'''
data = session.run(query)
nonhemo_stroke_pts = data.value()[0]
len(nonhemo_stroke_pts)

144

In [5]:
# Find all the patients with stroke based on ICD9 codes starting with 434
query = '''
MATCH path = (n:D_Icd_Diagnoses)-[:DESCRIBES]->(dx:Diagnoses_Icd)<-[:HAD]-(pt:Patients)
WHERE dx.icd9_code STARTS WITH '434'
RETURN COLLECT(DISTINCT(pt.subject_id)) as ICD9_CVA_pts, COLLECT(DISTINCT(n.long_title)) AS Diagnoses
'''
data = session.run(query)
Diagnoses = data.value('Diagnoses')[0]
print("Included Diagnoses:")
for dx in Diagnoses:
    print('- ',dx)

query = '''
MATCH path = (n:D_Icd_Diagnoses)-[:DESCRIBES]->(dx:Diagnoses_Icd)<-[:HAD]-(pt:Patients)
WHERE dx.icd9_code STARTS WITH '434'
RETURN COLLECT(DISTINCT(pt.subject_id)) as ICD9_CVA_pts, COLLECT(DISTINCT(n.long_title)) AS Diagnoses
'''
data = session.run(query)
ICD9_CVA_pts = data.value('ICD9_CVA_pts')[0]
print("Total patients:",len(ICD9_CVA_pts))

Included Diagnoses:
-  Cerebral thrombosis without mention of cerebral infarction
-  Cerebral thrombosis with cerebral infarction
-  Cerebral embolism without mention of cerebral infarction
-  Cerebral embolism with cerebral infarction
-  Cerebral artery occlusion, unspecified without mention of cerebral infarction
-  Cerebral artery occlusion, unspecified with cerebral infarction
Total patients: 1452


In [6]:
# Change the stroke patient lists into sets and combine them
all_stroke_pts = nonhemo_stroke_pts + ICD9_CVA_pts
print(len(all_stroke_pts))
all_stroke_pts =set(all_stroke_pts) # Remove any duplicates
print(len(all_stroke_pts))

1596
1525


In [4]:
# Find all the patients who received any anticoagulation
query = '''
MATCH path = (c1:Concept {aui: 'A12101446'})-[*..4]-(rx:Prescriptions)<-[:HAD]-(pt:Patients)
RETURN collect(DISTINCT(pt.subject_id)) AS anticoagulated
'''
data = session.run(query)
anticoagulated = data.value()[0]
len(anticoagulated)

In [7]:
len(anticoagulated)

20283

In [79]:
# Find all the patients diagnosed with atrial fibrillation
query = '''
MATCH (dx_i:D_Icd_Diagnoses {short_title: 'Atrial fibrillation'})-[:DESCRIBES]-(dx:Diagnoses_Icd)<-[:HAD]-(pt:Patients)
RETURN collect(DISTINCT(pt.subject_id))
'''
data = session.run(query)
afib_pts = data.value()[0]
print(len(afib_pts))
print(afib_pts[:3])

10271
['1406', '94316', '86763']


In [78]:
# Find all the patients with an ECG reporting atrial fibrillation
query = '''
MATCH (pt:Patients)-[:HAD]->(n:Noteevents {category:'ECG'})
WHERE n.text CONTAINS "trial fib"
RETURN COLLECT(DISTINCT(pt.subject_id)) AS afib_pts_by_ECG
'''
data = session.run(query)
afib_pts_by_ECG = data.value()[0]
print(len(afib_pts_by_ECG))

7448


In [80]:
# Find patients with ECG reporting atrial fibrillation but no ICD diagnosis
AF_but_no_ICD = set(afib_pts_by_ECG) - set(afib_pts)
len(AF_but_no_ICD)

1147

In [None]:
# Find all the patients with thromboembolic stroke
query = '''
MATCH (dx_i:D_Icd_Diagnoses {short_title: 'Atrial fibrillation'})-[:DESCRIBES]-(dx:Diagnoses_Icd)<-[:HAD]-(pt:Patients)
RETURN collect(DISTINCT(pt.subject_id))
'''
data = session.run(query)
afib_pts = data.value()[0]
print(len(afib_pts))
print(afib_pts[:3])

In [73]:
# Print initial counts
print('Total number of patients with Afib:', len(afib_pts))
print('Total number of patients with stroke:', len(all_stroke_pts))
print('Total number of patients who were anticoagulated:', len(anticoagulated))

# Find all the patients with Afib who did not receive anticoagulation
anticoagulated = set(anticoagulated)
afib_pts = set(afib_pts)
AF_no_AC = afib_pts - anticoagulated
print('Count of patients with Afib who are NOT anticoagulated:',len(AF_no_AC))

# Find all the patients with Afib who received anticoagulation
AF_with_AC = afib_pts - AF_no_AC
print('Count of patients with Afib who ARE anticoagulated:',len(AF_with_AC))

# Find all patients with Afib who did not have stroke
AF_no_stroke = afib_pts - all_stroke_pts
print('Count of patients with Afib who did NOT have stroke:', len(AF_no_stroke))

# Find all patients with Afib who DID have stroke
AF_with_stroke = afib_pts - AF_no_stroke
print('Count of patients with Afib who DID have stroke:', len(AF_with_stroke))

# Find all the patients with Afib and anticoagulation who did NOT have stroke
AF_AC_no_stroke = AF_with_AC - all_stroke_pts
print('Count of patients with Afib and anticoagulation who did NOT have stroke:', len(AF_AC_no_stroke))

# Find all the patients with Afib and anticoagulation who DID have stroke
AF_AC_with_stroke = AF_with_AC - AF_AC_no_stroke
print('Count of patients with Afib and anticoagulation who DID have stroke:', len(AF_AC_with_stroke))

# Find all the patients with Afib and NO anticoagulation who did NOT have stroke
AF_no_AC_no_stroke = AF_no_AC - all_stroke_pts
print('Count of patients with Afib and NO anticoagulation who did NOT have stroke:', len(AF_no_AC_no_stroke))

# Find all the patients with Afib and NO anticoagulation who DID have stroke
AF_no_AC_with_stroke = AF_no_AC - AF_no_AC_no_stroke
print('Count of patients with Afib and NO anticoagulation who DID have stroke:', len(AF_no_AC_with_stroke))

Total number of patients with Afib: 10271
Total number of patients with stroke: 1525
Total number of patients who were anticoagulated: 20283
Count of patients with Afib who are NOT anticoagulated: 3415
Count of patients with Afib who ARE anticoagulated: 6856
Count of patients with Afib who did NOT have stroke: 9625
Count of patients with Afib who DID have stroke: 646
Count of patients with Afib and anticoagulation who did NOT have stroke: 6362
Count of patients with Afib and anticoagulation who DID have stroke: 494
Count of patients with Afib and NO anticoagulation who did NOT have stroke: 3263
Count of patients with Afib and NO anticoagulation who DID have stroke: 152


In [75]:
# Change groups of patient IDs from set to list types
AF_no_AC = list(AF_no_AC)
AF_with_AC = list(AF_with_AC)
afib_pts = list(afib_pts)
all_stroke_pts = list(all_stroke_pts)
AF_no_stroke = list(AF_no_stroke)
AF_with_stroke = list(AF_with_stroke)
AF_AC_no_stroke = list(AF_AC_no_stroke)
AF_AC_with_stroke = list(AF_AC_with_stroke)
AF_no_AC_no_stroke = list(AF_no_AC_no_stroke)
AF_no_AC_with_stroke = list(AF_no_AC_with_stroke)

In [76]:
AF_no_AC_with_stroke[:3]

['3522', '68245', '30902']

In [8]:
query = '''CREATE INDEX Diagnoses_Icd_subject_ID FOR (d:Diagnoses_Icd) ON (d.subject_id)'''
session.run(query)

<neo4j.work.result.Result at 0x7fb15d91f670>

In [27]:
# Get counts of all comorbidities for all patients with Afib
query = '''
MATCH (D_dx:D_Icd_Diagnoses)-[:DESCRIBES]->(dx:Diagnoses_Icd)
WHERE dx.subject_id in {afib_pts}
RETURN D_dx.long_title AS Description, dx.icd9_code AS ICD9_Code, count(dx) as Number
ORDER BY Number DESC
'''.format(afib_pts=afib_pts)
afib_pts_df = session.run(query)
afib_pts_df = pd.DataFrame([dict(record) for record in afib_pts_df])
afib_pts_df

Unnamed: 0,Description,ICD9_Code,Number
0,Atrial fibrillation,42731,12891
1,Unspecified essential hypertension,4019,6591
2,"Congestive heart failure, unspecified",4280,6568
3,Coronary atherosclerosis of native coronary ar...,41401,4897
4,"Acute kidney failure, unspecified",5849,3390
...,...,...,...
3898,Antimalarials and drugs acting on other blood ...,E9314,1
3899,"Ectropion, unspecified",37410,1
3900,Striking against or struck accidentally by obj...,E9170,1
3901,Mechanical failure of instrument or apparatus ...,E8742,1


In [28]:
# Get counts of all comorbidities for patients with Afib who ARE anticoagulated
query = '''
MATCH (D_dx:D_Icd_Diagnoses)-[:DESCRIBES]->(dx:Diagnoses_Icd)
WHERE dx.subject_id in {AF_with_AC}
RETURN D_dx.long_title AS Description, dx.icd9_code AS ICD9_Code, count(dx) as Number
ORDER BY Number DESC
'''.format(AF_with_AC=AF_with_AC)
AF_with_AC_df = session.run(query)
AF_with_AC_df = pd.DataFrame([dict(record) for record in AF_with_AC_df])
AF_with_AC_df

Unnamed: 0,Description,ICD9_Code,Number
0,Atrial fibrillation,42731,9123
1,"Congestive heart failure, unspecified",4280,5043
2,Unspecified essential hypertension,4019,4693
3,Coronary atherosclerosis of native coronary ar...,41401,3619
4,Other and unspecified hyperlipidemia,2724,2705
...,...,...,...
3597,"Open wound of tongue and floor of mouth, witho...",87364,1
3598,Vascular disorders of penis,60782,1
3599,"Other complications due to unspecified device,...",99670,1
3600,Need for prophylactic vaccination and inoculat...,V066,1


In [29]:
# Get all patients with thromboembolic stroke


Unnamed: 0,Description,ICD9_Code,Number
0,Atrial fibrillation,42731,9123
1,"Congestive heart failure, unspecified",4280,5043
2,Unspecified essential hypertension,4019,4693
3,Coronary atherosclerosis of native coronary ar...,41401,3619
4,Other and unspecified hyperlipidemia,2724,2705
...,...,...,...
3597,"Open wound of tongue and floor of mouth, witho...",87364,1
3598,Vascular disorders of penis,60782,1
3599,"Other complications due to unspecified device,...",99670,1
3600,Need for prophylactic vaccination and inoculat...,V066,1


In [30]:
gen_pop_total = sum(afib_pts_df['Number'])
afib_pts_df['Gen_pop_proportion'] = afib_pts_df['Number']/gen_pop_total

afib_pts_df = afib_pts_df[afib_pts_df['Number'] > 50]

comorb_total = sum(AF_with_AC_df['Number'])
AF_with_AC_df['Comorbidities_proportion'] = AF_with_AC_df['Number']/comorb_total

AF_with_AC_df = AF_with_AC_df[AF_with_AC_df['Number'] > 50]

# Merge the "Gen_pop_proportion" column from afib_pts_df into AF_with_AC_df
AF_with_AC_df = pd.merge(AF_with_AC_df, afib_pts_df, on=['ICD9_Code', 'Description'])

AF_with_AC_df['Odds_Ratio'] = (AF_with_AC_df['Comorbidities_proportion']/AF_with_AC_df['Gen_pop_proportion'])
AF_with_AC_df.sort_values(by='Odds_Ratio', ascending=False, inplace=True)

AF_with_AC_df.loc[:,['Description', 'Odds_Ratio']].head(20)

Unnamed: 0,Description,Odds_Ratio
152,Ventilator associated pneumonia,1.254416
126,Renal dialysis status,1.249336
323,Other and unspecified Escherichia coli [E. coli],1.247693
358,Personal history of Methicillin resistant Stap...,1.243835
385,Other acquired absence of organ,1.239643
384,Acquired absence of intestine (large) (small),1.239643
411,"Other specified forms of effusion, except tube...",1.236019
300,"Acute edema of lung, unspecified",1.235754
427,Personal history of pulmonary embolism,1.234384
432,"Chronic kidney disease, Stage II (mild)",1.233522


## Get the relevant literature

In [49]:
from progressbar import ProgressBar
import requests
from bs4 import BeautifulSoup
import json
import re
import urllib.parse
import pandas as pd
import time
from datetime import datetime
import requests

In [2]:
# Define a function that takes a term, searches PubMed for that term, and returns a list of the 
# PMIDs of the articles found
def find_pmid_list_for(term, max_result_count=1000):
    esearch_query_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax={retmax}&term={term}'.format(retmax=max_result_count, term=term)
    response = requests.get(esearch_query_url)
    content = response.content
    soup = BeautifulSoup(content, 'html.parser')
    try:
        ids_str = soup.idlist.get_text()
        ids_str = ids_str.replace('\n',',')
        ids_str = ids_str[1:-1] 
        ids_str = ids_str.split(',')
        return ids_str
    
    except:
        return []

In [3]:
# Identify search terms of interest
pubmed_query_list = ['atrial fibrillation', 'ischemic stroke', 'cardioembolic stroke', 'anticoagulation', 'anticoagulant', 'bleeding risk', 'hemorrhage', 'cardioversion','cha2ds2-vasc','stroke risk','rhythm control','rate control']

In [96]:
# Perform a pubmed search for each term that returns at most 1000 articles per query and add them to a set of PMIDs
PMIDs_list = set()
pbar = ProgressBar()
for term in pbar(pubmed_query_list):
    PMIDs_list.update(find_pmid_list_for(term, max_result_count=1000))
print("Number of items in list: ",len(PMIDs_list))

PMIDs_list = list(PMIDs_list) # change pmid_list from a set to a list
PMIDs_list = PMIDs_list[1:] # remove the item in the list, which is blank
PMIDs_list[:3] # check to be sure the blank item was removed properly

100% |########################################################################|

Number of items in list:  9755





['33833839', '34063361', '33549807']

In [94]:
# Fetch a batch of articles
def get_articles(PMID_batch):
    url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={PMID_batch}&retmode=xml&rettype=abstract'
    url = url.format(PMID_batch=PMID_batch)
    response = requests.get(url)
    content = response.content
    soup = BeautifulSoup(content, 'html.parser')
    return soup

def causality_search(pmid):
    
    regex_cause = r"([^.\n]*?[^-]caus*[^.]*\.[^0-9])"
    regex_due = r"([^.\n]*?[^-]due to[^.]*\.[^0-9])"
    regex_result = r"([^.\n]*?[^-]result*[^.]*\.[^0-9])"
    
    sentence_list = []
    
    article = soup.find(text=pmid)
    try:
        title = article.findNext('articletitle').get_text()
    except:
        title = ""
    try:
        abstract = article.findNext('abstracttext').get_text()
    except:
        abstract = ""
    if len(title+abstract) == 0:
        pass
    else:
        title_abstract = title.upper()+' '+abstract
        sentence_list+=re.findall(regex_cause, title_abstract)
        sentence_list+=re.findall(regex_due, title_abstract)
        sentence_list+=re.findall(regex_result, title_abstract)
    if type(sentence_list) == None:
        pass
    elif len(sentence_list) == 0:
        pass
    else:
        data = ['causes', pmid, sentence_list, title_abstract]
        return data

In [97]:
data = []
next_batch_start = 0
batch_size = 200
for i in range(0, len(PMIDs_list), batch_size):
    
    # Get a batch of 200 PMIDs
    if next_batch_start + batch_size < len(PMIDs_list):
        next_batch_start += batch_size
        PMID_batch = PMIDs_list[i:next_batch_start]
    else:
        next_batch_start += batch_size
        PMID_batch = PMIDs_list[i:]
    
    # Fetch the articles for the batch of PMIDs
    soup = get_articles(PMID_batch)
    
    # For each article, do a causality search and return the data extracted for the article
    for pmid in PMID_batch:
        results = causality_search(pmid)
        if results == None:
            pass
        else:
            data.append(results)
    
df = pd.DataFrame(data, columns=['term', 'pmid', 'focused_sentences', 'text'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2646 entries, 0 to 2645
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   term               2646 non-null   object
 1   pmid               2646 non-null   object
 2   focused_sentences  2646 non-null   object
 3   text               2646 non-null   object
dtypes: object(4)
memory usage: 82.8+ KB


In [98]:
df = df.explode('focused_sentences', ignore_index=True)
df.head()

Unnamed: 0,term,pmid,focused_sentences,text
0,causes,33833839,Patient and Methods: An 85-year-old man was t...,A CASE OF ACUTE CEREBRAL INFARCTION WITH A FAV...
1,causes,34063361,Cerebral embolism due to infective endocardit...,IMPACT OF OPERATIVE TIMING IN INFECTIVE ENDOCA...
2,causes,34621588,An elongated styloid process is known to caus...,RECURRENCE OF INTERNAL CAROTID ARTERY DISSECTI...
3,causes,34621588,Previous reports claim that internal carotid ...,RECURRENCE OF INTERNAL CAROTID ARTERY DISSECTI...
4,causes,34594298,Background: Mechanical thrombectomy (MT) has ...,INITIAL EXPERIENCE PERFORMING MECHANICAL THROM...


In [99]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3853 entries, 0 to 3852
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   term               3853 non-null   object
 1   pmid               3853 non-null   object
 2   focused_sentences  3853 non-null   object
 3   text               3853 non-null   object
dtypes: object(4)
memory usage: 120.5+ KB


In [100]:
df.to_csv('Afib_fine-tuning_dataset.csv', index=False)

Use [Annotation_Tool.ipynb](Annotation_Tool.ipynb) to extract relationships for Afib_fine-tuning_dataset.csv and further fine-tune the GPT3 model.