# Lean Prototype

Any graph should be modeled to suit the uses for which is was created, as far as they can be known. A knowledge graph is no different. In order to prepare training data for a transformer language model to extract knowledge from free text, we must have a model of our knowledge graph in mind. We must demonstrate consistent patterns for the language model to follow, and the output must be able to fit back into a useful graph. Here, we shall start at the end of that process to develop some simple rules of reasoning which will be able to use a knowledge graph to perform certain cognitive tasks. We shall assume that we can model our knowledge graph in any way that best suits our reasoning framework. Let us start with a few use cases.  
  
1. Given an observed association between a drug and a disease, generate an explanation for the association and hypothesize whether the drug causes the disease, treats it, or does neither. (First Data Bank FDB - used for EMR alerts - reports mild-moderate-severe interactions, but lacks causal direction and detail: e.g. MMR-Rhogam interaction is listed as severe, but you should increase dose of MMR, not omit it)
2. Predict what other problems a patient is likely to experience given their current problem list, recommend how to prevent those comorbidities, and explain the predictions and recommendations. 
3. Show any close relationships among a given patient's problems at admission, highlight any causal chains among them, and identify any parts of the causal chain the patient is likely to have which are not yet reported in the EHR. Suggest a workup for these missing problems and what you would do to treat them. Check to see how often the discharge summary lists problems you identified as likely to be missing from the admission note. 
4. Show the typical course of any disease using both timedeltas and causal chains. Show where a given patient is in the course of the disease. If there are subgroups of patients who have different courses for the same disease, show which subgroup a given patient is in and what path (if any) leads from their current subgroup to a subgroup with better outcomes. 
5. For a patient with a given set of diseases, show how to maximize the timedelta between each of their problems and death. 
6. For drug-disease pairs that share an association in MIMIC-III but no known `treats` relationship in MED-RT, hypothesize why they are associated. In cases where the drug is hypothesized to cause the disease, identify any patients with instances of both the disease and the drug, and present the evidence for causation for review by a pharmacist or the ordering physician.
7. Given a specific patient with a new diagnosis of atrial fibrillation, decide whether or not to anticoagulate and explain the recommendation. 
  
Lean setup for a use-case:
1. Generate a comprehensive knowledge graph for MIMIC-III.  
    A. Search pubmed for articles about all of the problems, prescriptions, and abnormal labs found in MIMIC-III.  
    B. Run all the articles through GPT3 to extract entities and relationships.  
    C. Merge the entities and relationships into a graph that is pre-loaded with UMLS, MED-RT, RxNorm, and MeSH.  
    D. Merge all entities in the patient data as instances of entities in the knowledge graph.   
2. Develop a reasoning framework that can be used to fulfill use cases.  
3. Iterate steps 1 and 2 to revise the knowledge model and reasoning framework until the use cases can be completed.  

## 1. Generate a comprehensive knowledge graph for MIMIC-III.
### 1-A. Search pubmed for articles about all of the problems, prescriptions, and abnormal labs found in MIMIC-III. 

In [82]:
import requests
from bs4 import BeautifulSoup
import json
import urllib.parse
from datetime import datetime
from progressbar import ProgressBar
import pickle

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
import time
import re
import pprint
import numpy as np
import multiprocessing
from multiprocessing import  Pool

# Runing this cell will ask you to provide the database password. If you need the password, email Tim McLerran at tmclerran@gmail.com to provide evidence 
# that you have been granted access to MIMIC-III data by physionet. If you are unsure how to proceed, just ask the #nlp channel on Slack
import getpass
password = getpass.getpass("\nPlease enter the Neo4j database password to continue \n")

# Create a connection to the working group's Neo4j database of MIMIC-III data
from neo4j import GraphDatabase
driver=GraphDatabase.driver(uri="bolt://76.251.77.235:7687", auth=('neo4j',password))
session=driver.session()


Please enter the Neo4j database password to continue 
 ·······


In [21]:
# Search pubmed for articles about all of the problems, prescriptions, and abnormal labs found in MIMIC-III.

# Define a function that takes a cypher query and returns a list-type result
def get_list(query):
    result = session.run(query)
    entire_result = []
    for record in result:
        entire_result.append(record[0])
    return entire_result[0]

In [24]:
# Generate a list of all unique problems found in MIMIC-III:
query = 'MATCH (prob:Problem) RETURN collect(DISTINCT(prob.description))'
problems = get_list(query)
print("Number of items in list: ",len(problems))

Number of items in list:  3250


In [26]:
# Generate a list of all unique prescriptions found in MIMIC-III:
query = 'MATCH (rx:Prescriptions) RETURN collect(DISTINCT(rx.DRUG))'
drugs = get_list(query)
print("Number of items in list: ",len(drugs))

Number of items in list:  4526


In [28]:
# Generate a list of all unique labs which are flagged as abnormal at least once in MIMIC-III:
query = '''
MATCH (lab:Labevents)
WHERE lab.FLAG = "abnormal"
WITH collect(DISTINCT(lab.ITEMID)) AS abnormals
MATCH (description:D_Labitems)
WHERE description.ITEMID IN abnormals
RETURN collect(description.LABEL)'''

labs = get_list(query)
print("Number of items in list: ",len(labs))

Number of items in list:  305


In [40]:
# Combine the lists for problems, prescriptions, and abnormal labs
pubmed_query_list = problems + drugs + labs
print("Number of items in list: ",len(pubmed_query_list))

Number of items in list:  9357


In [55]:
# Define a function that takes a term, searches PubMed for that term, and returns a list of the 
# PMIDs of the articles found
def find_pmid_list_for(term, max_result_count=1000):
    esearch_query_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax={retmax}&term={term}'.format(retmax=max_result_count, term=term)
    response = requests.get(esearch_query_url)
    content = response.content
    soup = BeautifulSoup(content, 'html.parser')
    try:
        ids_str = soup.idlist.get_text()
        ids_str = ids_str.replace('\n',',')
        ids_str = ids_str[1:-1] 
        ids_str = ids_str.split(',')
        return ids_str
    
    except:
        return []

In [66]:
# Perform a pubmed search for each term that returns at most 1000 articles per query and add them to a set of PMIDs
pmid_list = set()
pbar = ProgressBar()
for term in pbar(pubmed_query_list):
    pmid_list.update(find_pmid_list_for(term, max_result_count=1000))
print("Number of items in list: ",len(pmid_list))

100% |########################################################################|

Number of items in list:  1044253





In [81]:
pmid_list = list(pmid_list) # change pmid_list from a set to a list
pmid_list = pmid_list[1:] # remove the item in the list, which is blank
pmid_list[:3] # check to be sure the blank item was removed properly

['25913168', '32671735', '34022190']

In [83]:
# Save the pmid_list
with open('MIMIC_pmid_list', 'wb') as fp:
    pickle.dump(pmid_list, fp)

In [84]:
# Retrieve the pmid_list from disc and check to be sure it still looks right
with open ('MIMIC_pmid_list', 'rb') as fp:
    saved_pmid_list = pickle.load(fp)

print("Number of items in list: ",len(saved_pmid_list))
saved_pmid_list[:3]

Number of items in list:  1044252


['25913168', '32671735', '34022190']

To-Do:


### 1-B. Run all the articles through GPT3 to extract entities and CAUSES/SYNONYM relationships.
Define a set of functions that
- take a batch of PubMed PMIDs
- extract the articles
- select the sentences which have causal statements
- use the GPT3 API to extract causal relationships
- store the causal relationships in a dataframe along with pmid and article's level of evidence
- merge the entities in as Concept nodes. Merge the text of the GPT3-extracted concept against the UMLS Concept "term" property

## Reasoning Framework
See [Reasoning_Framework.ipynb](Reasoning_Framework.ipynb)

### Present plans to the patient and update aversion scores based on their acceptance/rejection. 