# Version 1
Strategy: Load all UMLS strings into memory and perform custom-built search for UMLS concepts  
Performance: ~40 seconds to extract all problems from a single patient's notes, inappropriately extracts some concepts

## Instructions
Given a jupyter notebook with code to create a python list of strings containing the Assessment and Plan portion of physician's notes, extract the patient's problems. At this early step, don't worry if your code also extracts items which are really part of the "housekeeping" items that are sometimes listed at the very end of the note.  

#### Expected output:
A list of problems that each patient has.

You can use a rules-based approach, medspacy, medcat, metamap, UMLS, bioportal annotator, or anything else you think will get the job done.

## Iteration-independent code

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
import time
import re
import pprint
import numpy as np
import multiprocessing
from multiprocessing import  Pool

In [2]:
# Runing this cell will ask you to provide the database password. If you need the password, email Tim McLerran at tmclerran@gmail.com to provide evidence 
# that you have been granted access to MIMIC-III data by physionet. If you are unsure how to proceed, just ask the #nlp channel on Slack
import getpass
password = getpass.getpass("\nPlease enter the Neo4j database password to continue \n")


Please enter the Neo4j database password to continue 
 ···············


In [3]:
# Create a connection to the working group's Neo4j database of MIMIC-III data
from neo4j import GraphDatabase
driver=GraphDatabase.driver(uri="bolt://localhost:7687", auth=('neo4j',password))
session=driver.session()

In [5]:
query = '''
MATCH (n:Noteevents)
WHERE n.CATEGORY = "Physician "
RETURN DISTINCT(n.SUBJECT_ID) as subject_id
ORDER BY subject_id'''
patients = session.run(query)
patients = pd.DataFrame([dict(record) for record in patients])
print(patients.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7623 entries, 0 to 7622
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   subject_id  7623 non-null   object
dtypes: object(1)
memory usage: 59.7+ KB
None


## Iterate over patients to create CSV files to batch import problems

In [None]:
# Cypher query to obtain all physician notes for a patient with a given subject_id
query = '''
MATCH (n:Noteevents)
WHERE n.CATEGORY = "Physician "
RETURN DISTINCT(n.TEXT) AS note, n.STORETIME as storetime, n.SUBJECT_ID as subject_id, n.ROW_ID as note_id
ORDER BY subject_id, storetime'''

In [7]:
def get_notes(query):

    # Execute the cypher query to obtain physician notes in a Neo4j object
    data = session.run(query)

    # Convert the neo4j object into a dataframe
    df = pd.DataFrame([dict(record) for record in data])

    # Define a function that takes a full physician's note and returns only the Assessment and Plan portion
    def isolate_AP(note):
        # Define a regex pattern to isolate the Assessment and Plan portion of each note
        try:
            pattern = re.compile(r'(Assessment and Plan.*)ICU Care', re.DOTALL)
            return re.search(pattern, note).group(1)
        except:
            pattern = re.compile(r'(Assessment and Plan.*)ICU \[\*\*', re.DOTALL)
            if re.search(pattern, note) != None:
                return re.search(pattern, note).group(1)
            else:
                return 'No AP isolated'

    # Apply the function that isolates AP to the column of the dataframe containing notes
    df['note'] = df['note'].apply(isolate_AP)

    # Check the output
    return df

In [8]:
start_time = time.time()
df = get_notes(query)
print("Total runtime:", time.time() - start_time, "seconds")
df.info()

Total runtime: 11.331737995147705 seconds
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141624 entries, 0 to 141623
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   note        141624 non-null  object
 1   storetime   141624 non-null  object
 2   subject_id  141624 non-null  object
 3   note_id     141624 non-null  object
dtypes: object(4)
memory usage: 4.3+ MB


In [9]:
df[df['note'] == 'No AP isolated']

Unnamed: 0,note,storetime,subject_id,note_id
175,No AP isolated,2129-12-31 23:20:53,10428,363396
176,No AP isolated,2130-01-01 01:24:42,10428,363403
199,No AP isolated,2165-05-06 01:57:11,10502,405218
213,No AP isolated,2166-05-23 02:34:57,10686,406408
242,No AP isolated,2140-10-12 10:48:07,10774,345108
...,...,...,...,...
141425,No AP isolated,2153-04-20 12:03:37,99912,524081
141438,No AP isolated,2153-04-22 08:06:06,99912,524592
141442,No AP isolated,2153-04-22 10:47:21,99912,524631
141517,No AP isolated,2161-02-17 13:59:34,99944,627286


To-do:
- Remake str_to_CUI as str_to_AUI using MRCONSO_for_import
- change CUI to AUI in all the code in this notebook
- recreate the problems_from_notes file
- re-import problems

In [10]:
# Load the table of all strings in the UMLS with their associated CUI and semantic type
str_to_CUI = pd.read_csv('/home/tim/Documents/GrApH_AI/Data/str_to_CUI.csv', encoding='utf-8')
str_to_CUI.dropna(inplace=True)
str_to_CUI.tail()

Unnamed: 0,STR,CUI,STY
14305583,ﾜﾝﾌｶｲｶﾝ,C0877610,Sign or Symptom
14305584,ﾜﾝﾍﾝｹｲ,C0919717,Anatomical Abnormality
14305585,ﾜﾝﾍﾝｹｲNOS,C0919717,Anatomical Abnormality
14305586,ﾜﾝﾎｳｿｳｴﾝ,C0562422,Disease or Syndrome
14305587,ﾜﾝﾚｯｼｮｳ,C0432974,Injury or Poisoning


In [11]:
# Define a function that conducts a fast binary search on a sorted column of a dataframe, returning only full match results.

def binary_search(dataframe, column, target):
    range_start = 0
    range_end = len(dataframe)-1
    while range_start < range_end:
        range_middle = (range_end + range_start) // 2
        value = dataframe.iloc[range_middle][column]
        if value == target:
            return dataframe.iloc[range_middle]
        elif value < target:
            # Discard the first half of the range
            range_start = range_middle + 1
        else:
            # Discard the second half of the range
            range_end = range_middle - 1
    # At this point range_start = range_end
    value = dataframe.iloc[range_start][column]
#     return value
    if value == target:
        return dataframe.iloc[range_start]
    else:
        return 0

In [12]:
# Define a function which takes a string and returns the problem identified at the start of the string

def text_to_CUIs(text):
    
    # Remove any non-alphanumeric characters, set the encodning to unicode, and split the text into a list of words
    text = re.sub('[\W_]+', ' ', text, flags=re.UNICODE)
    text = text.strip()
    text = text.split(' ')
    
    cui = None
    
    # Iterate through the beginning of the list of words to find the largest set of consecutive words that match CUI-associated strings
    
    for i in reversed(range(1,7)):
        term = ' '.join(text[:i])
        frame = binary_search(dataframe = str_to_CUI, column = 'STR', target = term)
        
        if type(frame) == int:
            frame = binary_search(dataframe = str_to_CUI, column = 'STR', target = term.lower())
            if type(frame) == int:
                frame = binary_search(dataframe = str_to_CUI, column = 'STR', target = term.upper())
                if type(frame) == int:
                    frame = binary_search(dataframe = str_to_CUI, column = 'STR', target = term.title())
                    if type(frame) == int:
                        pass
                      
                    else:
                        if frame['STY'] in ['Disease or Syndrome', 'Pathologic Function', 'Neoplastic Process', 'Sign or Symptom', 'Injury or Poisoning', 'Mental or Behavioral Dysfunction']:
                            return frame['CUI']
                else:
                    if frame['STY'] in ['Disease or Syndrome', 'Pathologic Function', 'Neoplastic Process', 'Sign or Symptom', 'Injury or Poisoning', 'Mental or Behavioral Dysfunction']:
                        return frame['CUI']
            else:
                if frame['STY'] in ['Disease or Syndrome', 'Pathologic Function', 'Neoplastic Process', 'Sign or Symptom', 'Injury or Poisoning', 'Mental or Behavioral Dysfunction']:
                    return frame['CUI']
        else:
            if frame['STY'] in ['Disease or Syndrome', 'Pathologic Function', 'Neoplastic Process', 'Sign or Symptom', 'Injury or Poisoning', 'Mental or Behavioral Dysfunction']:
                return frame['CUI']
        
            break

In [14]:
# Define a function that accepts an Assessment and Plan and extracts the problems listed
def AP_to_problems(AP):
    problem_set = set([])
    if AP != 'No AP isolated':
        rows_list = AP.split('\n')
        for row in rows_list:
            cui = text_to_CUIs(row)
            if cui != None:
                problem_set.add(cui)
        return problem_set
    else:
        return problem_set
        
# Define a function that applies the AP_to_problems function to the note column of a 
# dataframe and adds the result as a new "problem" column to the dataframe
def add_problems_col(df):
    df['problems'] = df['note'].apply(AP_to_problems)
    return df

# Define a function that uses parallel processing to perform a given function 
# on a given dataframe
def parallelize_dataframe(df, func):
#     n_cores = multiprocessing.cpu_count()
    n_cores = 11
    df_split = np.array_split(df, n_cores)
    pool = Pool(n_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

start_time = time.time()

# Run the function that adds a "problems" column to the dataframe, using the 
# parallel processing function
df = parallelize_dataframe(df, add_problems_col)

print("Total runtime:", time.time() - start_time, "seconds")

Total runtime: 39277.384462594986 seconds


|Number of cores|Seconds to extract AP from 1000 notes|  
|---|---|  
|8|322|  
|10|304|  
|11|295|  
|12|302|    

In [20]:
# # Inspect the results of problem extraction
# for i in range(df[:4].shape[0]):
#     print(df.iloc[i]['note'])
#     for problem in df.iloc[i]['problems']:
#         print(problem)
#     print('\n------------------------------------------------')

Assessment and Plan
   85F with h/o ESRD on HD, presenting with hypotension.
   .
   # hypotension - consistent with hypovolemic shock given clinical
   history, HCT changes may not reflect acuity of bleed.  remains
   hypotensive despite 3U PRBC in ED and 1L IVF, in setting of unclear
   volume loss.  given h/o CHF current CVP is more consistent cardiogenic
   shock, also cool extremities, though this could also represent
   peripheral vasoconstriction.  while less likely given lack of fever,
   leukocytosis, sepsis is also a consideration.
   - transduce CVP from cordis -> 19, c/w cardiogenic shock.
   - check SvO2 to confirm decreased CO, if present, consider swan, for
   pressor guided diuresis.
   - check TTE to evaluate [**Last Name (LF) **], [**First Name3 (LF) 300**], valves.
   - start amio gtt for tachycardia.
   - transfuse PRBC and IVF NS if CVP<10.
   - titrate levophed gtt for MAP>65, could consider neo gtt given
   tachycardia, however must consider severe PVD, and risk 

To-do:
- When tidying up to make the final version of code to run: in the AP_to_CUIs function, change the problem list of lists into a set of CUIs and change the write-out code to iterate through the set instead of iterating through the list of lists and then the set

In [16]:
df.shape[0]

141624

In [None]:
df.iloc[2:3]

In [15]:
df.iloc[3]['problems']

{'C0002871',
 'C0018802',
 'C0020538',
 'C0022661',
 'C0023518',
 'C0036974',
 'C0857353',
 'C1145670'}

Patient with a history of St. Louis Encephalitis presents with cough of 3 days duration, fever, and loss of consciousness complicated by head trauma. 

In [48]:
# Create a CSV for import

# Drop rows where no Assessment and Plan could be isolated
print(df.shape[0])
df_drop_empty_notes = df[df['note'] != 'No AP isolated']
print(df_drop_empty_notes.shape[0])

# Restructure the dataframe to have 1 problem per row
df_exploded = df_drop_empty_notes[['storetime', 'subject_id', 'note_id', 'problems']].explode(column='problems')
print(df_exploded.shape[0])

141624
131121
786230


In [50]:
df_exploded.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 786230 entries, 0 to 141623
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   storetime   786230 non-null  object
 1   subject_id  786230 non-null  object
 2   note_id     786230 non-null  object
 3   problems    779770 non-null  object
dtypes: object(4)
memory usage: 30.0+ MB


In [49]:
df_exploded[df_exploded['problems'].isnull()]

Unnamed: 0,storetime,subject_id,note_id,problems
54,2136-11-15 00:23:24,10174,354119,
85,2154-03-30 09:30:57,10206,320916,
90,2154-03-31 09:16:35,10206,321034,
92,2154-03-31 11:35:07,10206,321051,
110,2151-07-18 02:36:43,10302,330984,
...,...,...,...,...
141547,2157-02-23 07:06:06,99957,654757,
141549,2157-02-23 09:54:23,99957,654790,
141550,2157-02-23 11:28:03,99957,654815,
141563,2180-11-29 07:06:35,99973,550825,


In [54]:
df_exploded[:10]

Unnamed: 0,storetime,subject_id,note_id,problems
0,2142-06-02 10:22:01,10134,377333,C0002871
0,2142-06-02 10:22:01,10134,377333,C0020538
0,2142-06-02 10:22:01,10134,377333,C0036974
0,2142-06-02 10:22:01,10134,377333,C0018802
0,2142-06-02 10:22:01,10134,377333,C0022661
0,2142-06-02 10:22:01,10134,377333,C0023518
0,2142-06-02 10:22:01,10134,377333,C1145670
0,2142-06-02 10:22:01,10134,377333,C0857353
1,2142-06-02 12:13:26,10134,377338,C0002871
1,2142-06-02 12:13:26,10134,377338,C0020538


In [51]:
df_exploded.dropna(inplace=True)

In [52]:
df_exploded[df_exploded['problems'].isnull()]

Unnamed: 0,storetime,subject_id,note_id,problems


In [53]:
# Test imort of 10 nodes before importing the entire batch
test_import = df_exploded[:10]
test_import.to_csv('test_import.csv', index=False, encoding='utf-8')

In [55]:
# Write CSV to file
df_exploded.to_csv('problems_from_notes.csv', index=False, encoding='utf-8')

# Move CSV to the import folder of the database

In [16]:
# Create a constraint and index the Patients subject_id property
command = '''
CREATE CONSTRAINT Patient_subject_id IF NOT EXISTS
ON (n:Patients)
ASSERT n.subject_id IS UNIQUE
'''
session.run(command)

<neo4j.work.result.Result at 0x7f469d66e8e0>

In [14]:
# Batch import the CSV into the database

start_time = time.time()

command = '''
USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM "file:///problems_from_notes.csv" AS csvLine
MATCH (pt:Patients {subject_id:csvLine.subject_id})
CREATE (prob:Problem {cui:csvLine.problems, datetime_reported:csvLine.storetime})
CREATE (pt)-[:HAD_PROBLEM]->(prob)
WITH prob
MATCH (c:Concept)
WHERE prob.cui = c.cui
SET prob.description = c.term
'''
session.run(command)

print("Total runtime:", time.time() - start_time, "seconds")

Total runtime: 63.29610276222229 seconds


In [15]:
# Let's do a sanity check on the 200 most common problems in the database:
command = '''MATCH (p:Problem)
RETURN p.description, p.cui, count(*) AS number 
ORDER BY number DESC
LIMIT 200'''
data = session.run(command)

pd.set_option("display.max_rows", 200)

top_problems = pd.DataFrame([dict(record) for record in data])
print(top_problems)

                                                                                     p.description  \
0                                                                       rndx infection unspecified   
1                                                                  respiratory failure (diagnosis)   
2                                                                                   physical wound   
3                                                                Acute kidney failure, unspecified   
4                                                                               anemia (diagnosis)   
5                                              (Hypertensive disease) or (hypertension) (disorder)   
6                                                 Body temperature above reference range (finding)   
7                                                                  Altered mental status (finding)   
8                                                                  atrial fibrilla

In [17]:
# Further inspect problems which look suspicious or too general to be useful by looking
# up each CUI using the Metathesaurus Browser
top_problems.iloc[[24, 25, 28, 36, 38, 52, 66, 71, 98, 110, 127, 128, 195]]

Unnamed: 0,p.description,p.cui,number
24,"diagnoses, syndromes, and conditions (diagnosis)",C0012634,5810
25,but,C0233535,5713
28,Wiskott-Aldrich syndrome (diagnosis),C0043194,5161
36,NEURODEGENERATION WITH BRAIN IRON ACCUMULATION 2A,C0270724,4424
38,"MACROTHROMBOCYTOPENIA, NEPHRITIS, AND DEAFNESS",C0340978,4364
52,"(Disorders of fluid, electrolyte and acid-base balance) or (electrolyte disorders) (disorder)",C0267994,3043
66,Stilling,C1410088,2320
71,"Cluster, Symptom",C0039082,2138
98,disease (or disorder); alpine,C0002351,1590
110,Resolution of Pathologic Process,C1514893,1283


### What to do with suspicious problems  

| Problem|Action|Reason |  
| ----|---- |  
| Disease|Delete|Too general |  
| Butting|Delete|the word "but" was likely picked up inappropriately |  
| Wiskott-Aldrich Syndrome|Delete|the word "was" was likely picked up inappropriately |  
| Infantile Neuroaxonal Dystrophy|Delete|the word "plan" was likely picked up inappropriately |  
| May-Hegglin anomaly|Delete|the word "plan" was likely picked up inappropriately |  
| Disorder of fluid AND/OR electrolyte|Delete|Too general |  
| Still|Delete|the word "still" was likely picked up inappropriately |  
| Syndrome|Delete|Too general |  
| Altitude Sickness|Change to CUI C0278061 "Abnormal mental state"|The acronym AMS triggered association with altitude sickness, but Boston is at sea level and AMS typically means "Altered Mental Status" in ICU notes |  
| physiologic resolution|Delete|Too general |  
| MICROCEPHALY, EPILEPSY, AND DIABETES SYNDROME|Delete|the word "meds" was likely picked up inappropriately |  
| Ichthyosis linearis circumflexa|Delete|the acronym "NS" was likely picked up inappropriately |  


In [18]:
# Delete the problems marked above for deletion
delete_list = top_problems.iloc[[24, 25, 28, 36, 38, 52, 66, 71, 110, 127, 128, 195]]['p.cui'].to_list()
print(delete_list)
command = '''
MATCH (p:Problem)
WHERE p.cui IN {delete_list}
DETACH DELETE p'''.format(delete_list=delete_list)
session.run(command)

['C0012634', 'C0233535', 'C0043194', 'C0270724', 'C0340978', 'C0267994', 'C1410088', 'C0039082', 'C1514893', 'C3280240', 'C0265962', 'C0221423']


<neo4j.work.result.Result at 0x7f6d1ee5e100>

In [19]:
# Change the "Altitude Sickness" problems to "Abnormal mental state"
command = '''
MATCH (p:Problem {cui:'C0002351'}), (c:Concept_UMLS {cui:'C0278061'})
SET p.cui = c.cui
SET p.description = c.preferred_term
'''
session.run(command)

<neo4j.work.result.Result at 0x7f6d1ee6d6d0>

In [4]:
# Finally, let's connect our new problem nodes to their respective UMLS concept nodes and the notes from which the problems were extracted

start_time = time.time()

query = '''
"MATCH (p:Problem)
MATCH (c:Concept)
WHERE p.cui = c.cui
RETURN p, c",
"CREATE (p)-[:INSTANCE_OF]->(c)"'''
command = 'CALL apoc.periodic.iterate('+query+', {batchSize:1000, parallel: true, iterateList:true})'
session.run(command)

# command = '''
# MATCH (n:Noteevents)<-[:HAD]-(pt:Patients)-[:HAD_PROBLEM]->(p:Problem)
# WHERE p.datetime_reported = n.storetime
# MERGE (n)-[:REPORTED]->(p)
# '''
# session.run(command)

print("Total runtime:", time.time() - start_time, "seconds")

Total runtime: 94.5746169090271 seconds


In [3]:
# Create a fulltext index for the description property of the nodes labeled Problem
command = '''CALL db.index.fulltext.createNodeIndex("Pt_Problems", ["Problem"], ["description"])'''
session.run(command)

<neo4j.work.result.Result at 0x7ff3713b7ac0>

To-Do:
- Use Neo4J's builtin datetime functions to find the most recent nodes and to show the nodes in 
Speeding up the query:
- First try: Create nodes for all strings and a Neo4J index the string property of each node, then set a full-text index and use it for search
- Next try: Consider spark dataframe instead of pandas for the str_to_CUI dataframe if the above doesn't work

In [6]:
# Count the number of instances of each problem in a cluster of patients, and report the most frequently encountered problems 
query = '''MATCH (n:Problem) RETURN n.description AS problem, count(n.cui) AS instances ORDER BY instances DESC'''
PL = session.run(query)
ordered_prob_list = pd.DataFrame([dict(record) for record in PL])
ordered_prob_list[:20]

Unnamed: 0,problem,instances
0,Respiratory Failure,24
1,Atrial Fibrillation,22
2,Presenile dementia,21
3,Heart failure,21
4,Metabolic alkalosis,17
5,Hypothyroidism,15
6,Pulmonary Embolism,14
7,"Kidney Failure, Acute",12
8,Communicable Diseases,7
9,Butting,7


In [None]:
session.close()