# Patient Cohort Building with NLP and Knowledge Graphs

In this notebook, we will build a Neo4j Clinical Knowledge Graph (KG) from the output of a Spark NLP pipeline that contains NER (named entity recognition) and RE (relation extraction) pretrained models. After creating the knowledge graph, we will query the KG to get some insightful results.

To obtain the visualizations included below, run the provided queries (just the query, without quotation marks) in Neo4j Browser. This can be accessed from the Instances screen in Neo4j via the `>_ Query` tab. The password for the instance is required to access the graph database. 

Please notice that the nodes and the relations color codes might differ from the ones provided here, as they depend on the individual browser settings. Also, you may obtain slightly different outputs, these are related to the versions of the pretrained models used in the Spark NLP pipelines.

[Cluster Setup](https://nlp.johnsnowlabs.com/docs/en/licensed_install#install-on-databricks)

In [0]:
%pip install neo4j

In [0]:
%pip install tqdm

## Creation of the Knowledge Graph

### Neo4j Connection

In [0]:
from neo4j import GraphDatabase

import time
from tqdm import tqdm
import pandas as pd
import json

In [0]:
notes_path='/FileStore/HLS/jsl_kg/data/'
patient_df = pd.read_csv(f'/dbfs{notes_path}data.csv', sep=';')
patient_df

Unnamed: 0,subject_id,date,text,gender,dateOfBirth
0,19823,2167-02-25,Admission Date: [**2167-2-16**] Dischar...,F,2099-05-05
1,19823,2167-11-27,Admission Date: [**2167-11-27**] Discha...,F,2099-05-05
2,19823,2170-10-12,Admission Date: [**2170-9-19**] ...,F,2099-05-05
3,19823,2172-06-22,Admission Date: [**2172-6-13**] ...,F,2099-05-05
4,19823,2167-12-07,PATIENT/TEST INFORMATION:\nIndication: Aortic ...,F,2099-05-05
...,...,...,...,...,...
960,70004,2182-06-14,[**2182-6-14**] 10:45 AM\n MR HEAD W & W/O CON...,M,2127-12-06
961,70004,2182-06-25,FDG TUMOR IMAGING (PET-CT) ...,M,2127-12-06
962,70004,2182-08-05,[**2182-8-5**] 11:46 AM\n MR HEAD W & W/O CONT...,M,2127-12-06
963,70004,2182-08-23,FDG TUMOR IMAGING (PET-CT) ...,M,2127-12-06


In [0]:
patient_demographics = patient_df[['subject_id', 'gender', 'dateOfBirth']].drop_duplicates().reset_index(drop=True)
patient_demographics

Unnamed: 0,subject_id,gender,dateOfBirth
0,19823,F,2099-05-05
1,22015,M,2085-10-04
2,17494,F,2193-10-01
3,21153,M,2057-03-03
4,12200,M,2136-06-17
5,23266,M,2122-03-16
6,75632,M,2118-09-04
7,27386,M,2114-08-20
8,28552,F,2044-10-22
9,70004,M,2127-12-06


In [0]:
filename = 'posology_RE_rxnorm_w_drug_resolutions.csv'
folderdir = f'/FileStore/HLS/jsl_kg/data/'
pos_RE_result =  pd.read_csv(f'/dbfs{folderdir+filename}')
pos_RE_result.head()

Unnamed: 0,subject_id,date,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence,rx_text,sent_id,ner_chunk,entity,rxnorm_code,all_codes,resolutions,drug_resolution
0,19823,2167-02-25,DRUG-FORM,DRUG,1391,1399,albuterol,FORM,1414,1423,nebulizers,1.0,Albuterol nebulizers,0,Albuterol nebulizers,,2108226,['2108226' '1154602' '370790' '1154603' '21082...,['albuterol Inhalation Solution' 'albuterol In...,albuterol inhalation solution
1,19823,2167-02-25,DRUG-FORM,DRUG,1405,1412,atrovent,FORM,1414,1423,nebulizers,1.0,Atrovent nebulizers,0,Atrovent nebulizers,,2108451,['2108451' '1173573' '379767' '1173576' '24637...,['ipratropium Inhalation Solution [Atrovent]' ...,ipratropium inhalation solution [atrovent]
2,19823,2167-02-25,STRENGTH-DRUG,STRENGTH,1539,1543,40 mg,DRUG,1551,1555,lasix,1.0,Lasix 40 mg,0,Lasix 40 mg,,200809,['200809' '617319' '103919' '1871459' '201286'...,['furosemide 40 MG Oral Tablet [Lasix]' 'atorv...,furosemide 40 mg oral tablet [lasix]
3,19823,2167-02-25,ROUTE-DRUG,ROUTE,1548,1549,iv,DRUG,1551,1555,lasix,1.0,Lasix,0,Lasix,,202991,['202991' '151963' '2256936' '2256930' '104372...,['Lasix' 'Lasma' 'lasmiditan Oral Tablet' 'las...,lasix
4,19823,2167-02-25,DRUG-STRENGTH,DRUG,2336,2341,amaryl,STRENGTH,2343,2348,2.0 mg,1.0,Amaryl 2.0 mg,0,Amaryl 2.0 mg,,901295,['901295' '153591' '1310138' '213799' '2399657...,['sodium fluoride 2.2 MG [Ludent]' 'glimepirid...,sodium fluoride 2.2 mg [ludent]


In [0]:
filename = 'ner_jsl_slim_results.csv'
folderdir = f'/FileStore/HLS/jsl_kg/data/'
ner_DF_result = pd.read_csv(f'/dbfs{folderdir+filename}')
ner_DF_result.head()

Unnamed: 0,subject_id,date,sentence_id,chunk,begin,end,ner_label
0,19823,2167-02-25,0,Shortness of breath,178,196,Symptom
1,19823,2167-02-25,0,cough,199,203,Symptom
2,19823,2167-02-25,1,diabetes type II,345,360,Disease_Syndrome_Disorder
3,19823,2167-02-25,1,congestive heart failure,363,386,Disease_Syndrome_Disorder
4,19823,2167-02-25,1,hypertension,413,424,Disease_Syndrome_Disorder


In [0]:
class Neo4jConnection:
    def __init__(self, uri, user, pwd):
        self.__uri = uri
        self.__user = user
        self.__pwd = pwd
        self.__driver = None
        try:
            self.__driver = GraphDatabase.driver(self.__uri, auth=(self.__user, self.__pwd))
        except Exception as e:
            print("Failed to create the driver:", e)
        
    def close(self):
        if self.__driver is not None:
            self.__driver.close()
        
    def query(self, query, parameters=None, db=None):
        assert self.__driver is not None, "Driver not initialized!"
        session = None
        response = None
        try: 
            session = self.__driver.session(database=db) if db is not None else self.__driver.session() 
            response = list(session.run(query, parameters))
        except Exception as e:
            print("Query failed:", e)
        finally: 
            if session is not None:
                session.close()
        return response

In [0]:
# Credentials for Neo4j graph database

#uri = dbutils.secrets.get("solution-accelerator-cicd","neo4j-uri") # replace with '<Neo4j Aura instance uri>' or set up this secret in your own workspace
uri = ''

#pwd = dbutils.secrets.get("solution-accelerator-cicd","neo4j-password") # replace with '<Neo4j Aura instance password>' or set up this secret in your own workspace
pwd = ''

#user = dbutils.secrets.get("solution-accelerator-cicd","neo4j-user") # replace with '<Neo4j Aura instance user>' or set up this secret in your own workspace
user='neo4j'

# Establish the connection with Neo4j GDB
conn = Neo4jConnection(uri=uri, user=user , pwd=pwd)

**Creating constraints:**

In [0]:
conn.query('CREATE CONSTRAINT patients IF NOT EXISTS FOR (p:Patient) REQUIRE p.name IS UNIQUE;')
conn.query('CREATE CONSTRAINT rx_norm_codes IF NOT EXISTS FOR (rx:RxNorm) REQUIRE rx.code IS UNIQUE;')
conn.query('CREATE CONSTRAINT drugs IF NOT EXISTS FOR (drug:Drug) REQUIRE drug.name IS UNIQUE;')
conn.query('CREATE CONSTRAINT ners IF NOT EXISTS FOR (n:NER) REQUIRE n.name IS UNIQUE;')
conn.query('CREATE CONSTRAINT symptoms IF NOT EXISTS FOR (s:Symptom) REQUIRE s.name IS UNIQUE;')
conn.query('CREATE CONSTRAINT bodyParts IF NOT EXISTS FOR (bp:BodyPart) REQUIRE bp.name IS UNIQUE;')
conn.query('CREATE CONSTRAINT procedures IF NOT EXISTS FOR (p:Procedure) REQUIRE p.name IS UNIQUE;')
conn.query('CREATE CONSTRAINT tests IF NOT EXISTS FOR (t:Test) REQUIRE t.name IS UNIQUE;')
conn.query('CREATE CONSTRAINT dsds IF NOT EXISTS FOR (dsd:DSD) REQUIRE dsd.name IS UNIQUE;')

**defining helper functions:**

In [0]:
def update_data(query, rows, batch_size = 10000):
    total = 0
    batch = 0
    start = time.time()
    result = None
    while batch * batch_size < len(rows):
        res = conn.query(query, parameters={'rows': rows[batch*batch_size:(batch+1)*batch_size].to_dict('records')})
        total += res[0]['total']
        batch += 1
        result = {"total":total, "batches":batch, "time":time.time()-start}
        print(result)
    return result

In [0]:
def add_patients(rows, batch_size=10000):
    query = '''
    UNWIND $rows as row
    MERGE(p:Patient{name:row.subject_id}) 
    ON CREATE SET p.gender      = row.gender,
                  p.dateOfBirth = row.dateOfBirth

    WITH p
    MATCH (p)
    RETURN count(*) as total
    '''
    return update_data(query, rows, batch_size)

add_patients(patient_demographics)

In [0]:
def add_drugs_ners(rows, batch_size=1000):
    query = '''
    UNWIND $rows as row
    
    MERGE(p:Patient{name:row.subject_id}) 
    MERGE(rx:RxNorm{code:row.rxnorm_code})
    MERGE (p)-[:RXNORM_CODE{date:date(row.date)}]->(rx)
    
    MERGE (d:Drug{name:row.drug_resolution})
    MERGE (rx)-[:DRUG_GENERIC{date:date(row.date), patient_name:row.subject_id}]->(d)
    
    MERGE(n1:NER{name:row.chunk1}) ON CREATE SET n1.type=row.entity1
    MERGE(n2:NER{name:row.chunk2}) ON CREATE SET n2.type=row.entity2
    
    WITH *
    MATCH (d:Drug{name:row.drug_resolution}), (n1:NER{name:row.chunk1}), (n2:NER{name:row.chunk2})
    CALL apoc.create.relationship (d,row.entity1, {patient_name:row.subject_id, date:date(row.date)}, n1) YIELD rel as relx
    CALL apoc.create.relationship (d,row.entity2, {patient_name:row.subject_id, date:date(row.date)}, n2) YIELD rel as rely
    
    WITH d
    MATCH (d)
    RETURN count(*) as total  
    '''
    return update_data(query, rows, batch_size)
  
add_drugs_ners(pos_RE_result)

**splitting dataframe into multiple dataframes by ner_label and creating nodes and relationships**

In [0]:
# spliting dataframe into multiple dataframe by ner_label
grouped        = ner_DF_result.groupby('ner_label')
df_symptom     = grouped.get_group('Symptom')
df_dsd         = grouped.get_group('Disease_Syndrome_Disorder')
df_test        = grouped.get_group('Test')
df_bodyPart    = grouped.get_group('Body_Part')
df_procedure   = grouped.get_group('Procedure')

In [0]:
def add_symptoms(rows, batch_size=500):
    query = '''
    UNWIND $rows as row
    MATCH(p:Patient{name:row.subject_id})
    MERGE(n:Symptom {name:row.chunk})
    MERGE (p)-[:IS_SYMPTOM{date:date(row.date)}]->(n)

    WITH n
    MATCH (n)
    RETURN count(*) as total  
    '''
    return update_data(query, rows, batch_size)
  
add_symptoms(df_symptom)

In [0]:
def add_dsds(rows, batch_size=500):
    query = '''
    UNWIND $rows as row
    MATCH(p:Patient{name:row.subject_id})
    MERGE(n:DSD {name:row.chunk})
    MERGE (p)-[:IS_DSD{date:date(row.date)}]->(n)

    WITH n
    MATCH (n)
    RETURN count(*) as total  
    '''
    return update_data(query, rows, batch_size)
  
add_dsds(df_dsd)

In [0]:
def add_tests(rows, batch_size=500):
    query = '''
    UNWIND $rows as row
    MATCH(p:Patient{name:row.subject_id})
    MERGE(n:Test {name:row.chunk})
    MERGE (p)-[:IS_TEST{date:date(row.date)}]->(n)

    WITH n
    MATCH (n)
    RETURN count(*) as total  
    '''
    return update_data(query, rows, batch_size)
  
add_tests(df_test)

In [0]:
def add_bodyParts(rows, batch_size=500):
    query = '''
    UNWIND $rows as row
    MATCH(p:Patient{name:row.subject_id})
    MERGE(n:BodyPart {name:row.chunk})
    MERGE (p)-[:IS_BODYPART{date:date(row.date)}]->(n)

    WITH n
    MATCH (n)
    RETURN count(*) as total
    '''
    return update_data(query, rows, batch_size)
  
add_bodyParts(df_bodyPart)

In [0]:
def add_procedures(rows, batch_size=500):
    query = '''
    UNWIND $rows as row
    MATCH(p:Patient{name:row.subject_id})
    MERGE(n:Procedure {name:row.chunk})
    MERGE (p)-[:IS_PROCEDURE{date:date(row.date)}]->(n)

    WITH n
    MATCH (n)
    RETURN count(*) as total  
    '''
    return update_data(query, rows, batch_size)
  
add_procedures(df_procedure)

In [0]:
query_string = '''
CALL db.labels() YIELD label
CALL apoc.cypher.run('MATCH (:`'+label+'`) RETURN count(*) as count',{}) YIELD value
RETURN label, value.count as size
'''
df_nodes = pd.DataFrame([dict(_) for _ in conn.query(query_string)])
df_nodes

Unnamed: 0,label,size
0,Patient,10
1,RxNorm,1083
2,Drug,1083
3,NER,915
4,Symptom,2682
5,BodyPart,1221
6,Procedure,675
7,Test,1258
8,DSD,1179


In [0]:
query_string = '''
CALL db.relationshipTypes() YIELD relationshipType as type
CALL apoc.cypher.run('MATCH ()-[:`'+type+'`]->() RETURN count(*) as count',{}) YIELD value
RETURN type, value.count as size
'''
df_relationships = pd.DataFrame([dict(_) for _ in conn.query(query_string)])
df_relationships

Unnamed: 0,type,size
0,RXNORM_CODE,1704
1,DRUG_GENERIC,1704
2,DRUG,14110
3,FORM,1220
4,STRENGTH,4035
5,ROUTE,4695
6,FREQUENCY,2460
7,DOSAGE,1635
8,DURATION,65
9,IS_SYMPTOM,4157


**database schema visualization**

In [0]:
# To get the following visualization run the query in the Neo4j Browser
query_string = '''
CALL db.schema.visualization()
'''


<img src="https://raw.githubusercontent.com/iamvarol/blogposts/main/databricks/images/db_viz.png">

## Queries

**patient 21153's prescriptions:**

In [0]:
patient_name = '21153'
query_part1 = 'MATCH (p:Patient)-[rel_rx]->(rx:RxNorm)-[rel_d]->(d:Drug)-[rel_n]->(n:NER) ' #  
query_part2 = f'WHERE p.name ={patient_name} AND rel_n.date=rel_rx.date AND rel_n.patient_name=p.name ' # 
query_part3 = '''RETURN DISTINCT
                 p.name as patient_name,
                 rel_rx.date as date,
                 d.name as drug_generic_name,  
                 rx.code as rxnorm_code,
                 COALESCE(n.name,'') +  "(" + COALESCE (type(rel_n), "") + ")" as details
                 '''
query_string = query_part1 + query_part2 + query_part3

df = pd.DataFrame([dict(_) for _ in conn.query(query_string)])
df = df.drop_duplicates(subset= ['patient_name', 'date', 'drug_generic_name'])
df = df.groupby(['patient_name', 'date', 'drug_generic_name', 'rxnorm_code']).agg(lambda x: ' '.join(x)).reset_index()
df

Unnamed: 0,patient_name,date,drug_generic_name,rxnorm_code,details
0,21153,2109-12-17,dex4,607868,iv(ROUTE)
1,21153,2109-12-17,everolimus 2 mg tablet for oral suspension [af...,1310138,2 units(DOSAGE)
2,21153,2109-12-18,cidaflex,1013644,celexa(DRUG)
3,21153,2109-12-18,clorsulon oral solution,1006411,po(ROUTE)
4,21153,2109-12-18,crolom,216281,drip(ROUTE)
...,...,...,...,...,...
147,21153,2110-10-20,cotab a,799044,contrast(DRUG)
148,21153,2110-10-20,dibucaine 1 mg,335534,lidocaine(DRUG)
149,21153,2110-10-20,ofev,1592743,contrast(DRUG)
150,21153,2110-12-02,10 ml sodium chloride 9 mg/ml injection,1807637,normal\n saline(DRUG)


**patient 21153's journey in medical records:symptoms, procedures, disease-syndrome-disorders, test, drugs & rxnorms**

In [0]:
patient_name = '21153'

query_part1 = f'MATCH (p:Patient)-[r1:IS_SYMPTOM]->(s:Symptom) WHERE p.name = {patient_name} '
query_part2 = '''
WITH DISTINCT p.name as patients, r1.date as dates, COLLECT(DISTINCT s.name) as symptoms, COUNT(DISTINCT s.name) as num_symptoms

MATCH (p:Patient)-[r2:IS_PROCEDURE]->(pr:Procedure)
WHERE p.name=patients AND r2.date = dates

WITH DISTINCT p.name as patients, r2.date as dates, COLLECT(DISTINCT pr.name) as procedures, COUNT(DISTINCT pr.name) as num_procedures, symptoms, num_symptoms
MATCH (p:Patient)-[r3:IS_DSD]->(_d:DSD) 
WHERE p.name=patients AND r3.date = dates

WITH DISTINCT p.name as patients, r3.date as dates, symptoms, num_symptoms, procedures, num_procedures,  COLLECT(DISTINCT _d.name) as dsds, COUNT(DISTINCT _d.name) as num_dsds
MATCH (p:Patient)-[r4:IS_TEST]->(_t:Test) 
WHERE p.name=patients AND r4.date = dates

WITH DISTINCT p.name as patients, r4.date as dates, symptoms, num_symptoms, procedures, num_procedures, dsds, num_dsds, COLLECT(_t.name) as tests, COUNT(_t.name) as num_tests
MATCH (p:Patient)-[r5:RXNORM_CODE]->(rx:RxNorm)-[r6]->(_d:Drug)
WHERE p.name=patients AND r5.date = dates
RETURN DISTINCT p.name as patients, r5.date as dates, symptoms, num_symptoms, procedures, num_procedures, dsds, num_dsds, tests, num_tests, COLLECT(DISTINCT toLower(_d.name)) as drugs, COUNT(DISTINCT toLower(_d.name)) as num_drugs, COLLECT(DISTINCT rx.code) as rxnorms, COUNT(DISTINCT rx.code) as num_rxnorm
ORDER BY dates;
'''
query_string = query_part1 + query_part2
df = pd.DataFrame([dict(_) for _ in conn.query(query_string)])
df

Unnamed: 0,patients,dates,symptoms,num_symptoms,procedures,num_procedures,dsds,num_dsds,tests,num_tests,drugs,num_drugs,rxnorms,num_rxnorm
0,21153,2109-12-17,"[pale, cough reflex, unresponsive, soft - soft...",18,"[PT ALERT, thyraloglossal cyst removal, L [**4...",10,"[ABG'S STABLE, cirrhosis, PAIN, allergies, MON...",8,"[8p-7a, INR, LS CTA, INR REMAINS, POST PLT CT,...",20,"[dex4, everolimus 2 mg tablet for oral suspens...",2,"[607868, 1310138]",2
1,21153,2109-12-18,"[slight abd pain, Weak gag, cough, decrease ag...",19,"[PT ALERT, SM AMT SEROSANG DRG FROM INCISION, ...",7,"[DSD, DIAMOX STARTED AS ORDERED, anxiety, PH D...",9,"[PA, SvO2, 7p-7a, TCO2 DECEASED, magnesium, AB...",15,"[crolom, clorsulon oral solution, insulin argi...",5,"[216281, 1006411, 1740938, 1013644, 607868]",5
2,21153,2109-12-19,"[abd pain, soft to soft distended abd, scant s...",25,"[liver transplant, Anticipate extubation, Tole...",6,"[Assit with pulmonary tiolett, metabolic alkal...",6,"[Liver U/S, LS CTA, ABG's, RSBI-26, IVP, 7p-7a...",9,"[clonidine 0.00833 mg/hr transdermal system, t...",6,"[1360120, 220329, 151619, 1310821, 211372, 139...",6
3,21153,2109-12-20,"[C/O SOB, N/V, SOB]",3,"[MEDIAL JP WITH SM AMT SEROUS DRG, INCISION - ...",2,"[LUNGS CTA BILAT, FAINT BS, ABG'S ACCEPTABLE, ...",5,"[INCLUDING PLT CT, MIN SEROSANG DRG FROM, LS C...",5,[x-seb plus],1,[220956],1
4,21153,2109-12-21,"[C/o abd pain, Pain, tender, serous drainage, ...",18,[bile drainage],1,"[RA, Hypertensive]",2,"[HRR, K+, Hct, Mg++, Plt, FEN, cyclosporin lev...",8,[insulin lispro injectable solution],1,[378841],1
5,21153,2109-12-28,"[JVD, nontender, acute\ndistress, biliary duct...",15,"[liver transplant, extubated, tonsillectomy, e...",7,"[numerous spider nevi, chronic hepatitis\nC, C...",9,"[platelets, red blood cells, ALT, INR, lipase,...",32,"[pantoprazole 40 mg, citalopram, docosanol top...",47,"[330396, 2556, 855204, 351044, 317289, 1441735...",47
6,21153,2110-01-03,[opacification],1,[liver transplant],1,"[ductal dilatation, stricture, LIVER TX,ABD PAIN]",3,"[fluoroscopic\n guidance, LFTs]",2,"[slow trasicor, dibucaine]",2,"[154995, 3339]",2
7,21153,2110-01-14,"[decreased T-tube output, minimal intrahepatic...",11,"[liver transplant, transplant liver]",2,"[LIVER TRANSPLANT, abcess, segment VIII biliar...",6,"[LFTs, color Doppler, TUBE CHOLANGIOGRAM, chol...",6,[lisinopril 20 mg oral tablet [carace]],1,[201382],1
8,21153,2110-01-19,"[numerous splenorenal collaterals, Large subsc...",14,"[PTC drain placement, percutaneous tube placem...",2,"[biloma, ascites, LIVER TRANSPLANT, PTC PLACEM...",4,"[CT 100CC, CT OF THE PELVIS WITH, CT OF, Contr...",5,"[docusate sodium 100 mg [doculase], acalabruti...",4,"[1302275, 1986815, 799044, 1592743]",4
9,21153,2110-01-28,"[hepatic arterial\n blood, diminished arterial...",15,"[liver transplantation, hpatic artery anastamo...",8,"[biloma, renal dysfunction, Large varices, LIV...",5,"[delayed imaging, CAT scan, noncontrast scan, ...",14,"[viravan s, docusate sodium 100 mg [doculase],...",5,"[405398, 1302275, 574983, 105561, 1592743]",5


In [0]:
# To get the following visualization, run the query in the Neo4j Browser

query_string = '''
  MATCH (p:Patient)
  WHERE p.name = 21153

  CALL apoc.path.subgraphAll(p, {
      relationshipFilter: "RXNORM_CODE>|DRUG_GENERIC",
                minLevel: 0,
                maxLevel: 3
                })
   YIELD nodes, relationships
   RETURN nodes, relationships;
   '''

<img src="https://raw.githubusercontent.com/iamvarol/blogposts/main/databricks/images/patients_journey.png">

**which patient used isosorbide:**

In [0]:
# drug based query
drug_generic_name = 'isosorbide' 

query_part1 = 'MATCH (p:Patient)-[rel_rx]->(rx:RxNorm)-[rel_d]->(d:Drug)-[rel_n]->(n:NER) '
query_part2 = f'WHERE d.name CONTAINS "{drug_generic_name}" AND rel_n.date=rel_rx.date AND rel_n.patient_name=p.name '
query_part3 = '''RETURN DISTINCT
                 d.name as drug_generic_name, 
                 p.name as patient_name, 
                 rel_rx.date as date, 
                 rx.code as rxnorm_code, 
                 COALESCE(n.name,'') +  "(" + COALESCE (type(rel_n), "") + ")" as details'''

query_string = query_part1 + query_part2 + query_part3
df = pd.DataFrame([dict(_) for _ in conn.query(query_string)])
df = df.groupby(['patient_name', 'date', 'rxnorm_code','drug_generic_name']).agg(lambda x : ' '.join(x)).reset_index()
df

Unnamed: 0,patient_name,date,rxnorm_code,drug_generic_name,details
0,12200,2136-06-20,153689,isosorbide dinitrate 30 mg oral spray [isocard],piv(DRUG) 30cc/kg(STRENGTH)
1,19823,2167-12-01,201438,isosorbide mononitrate 60 mg extended release ...,po(ROUTE) x1(FREQUENCY) 60meq(DRUG)
2,19823,2172-06-20,201415,isosorbide dinitrate 5 mg sublingual tablet [i...,5 mg(STRENGTH) medicated(DRUG)
3,21153,2110-09-09,153689,isosorbide dinitrate 30 mg oral spray [isocard],contrast(DRUG) 30 cc(DOSAGE)
4,22015,2161-05-05,153689,isosorbide dinitrate 30 mg oral spray [isocard],kphos(DRUG) 30 meq(STRENGTH)
5,28552,2122-07-17,721832,hydralazine / isosorbide dinitrate oral tablet,"iv(ROUTE) ntg,iv hydralazine(DRUG)"


**patients who are prescribed Lasix between May 2060 and May 2125:**

In [0]:
query_string ='''
MATCH (p:Patient)-[rel_rx]->(rx:RxNorm)-[rel_d]->(d:Drug)-[rel_n:DRUG]->(n:NER)
WHERE d.name IN ['lasix']
      AND rel_n.patient_name=p.name
      AND rel_n.date=rel_rx.date 
      AND rel_rx.date >= date("2060-05-01")
      AND rel_n.date >= date("2060-05-01")
      AND rel_rx.date < date("2125-05-01")
      AND rel_n.date < date("2125-05-01")
RETURN DISTINCT
      d.name as drug_generic_name, 
      p.name as patient_name, 
      rel_rx.date as date
ORDER BY date ASC
'''

df = pd.DataFrame([dict(_) for _ in conn.query(query_string)])
df

Unnamed: 0,drug_generic_name,patient_name,date
0,lasix,28552,2122-07-17
1,lasix,28552,2122-07-18
2,lasix,28552,2122-07-20
3,lasix,28552,2122-07-21
4,lasix,28552,2122-07-22
5,lasix,28552,2122-07-24
6,lasix,28552,2122-07-29


In [0]:
# To obtain the visualization below run the query in Neo4j Browser:

query_string = '''
  MATCH (p:Patient)-[rel_rx]->(rx:RxNorm)-[rel_d]->(d:Drug)-[rel_n:DRUG]->(n:NER)
  WHERE d.name IN ['lasix']
      AND rel_n.patient_name=p.name
      AND rel_n.date=rel_rx.date 
      AND rel_rx.date >= date("2060-05-01")
      AND rel_n.date >= date("2060-05-01")
      AND rel_rx.date < date("2125-05-01")
      AND rel_n.date < date("2125-05-01")
  RETURN d, rel_rx, rx, rel_d, rel_n, p, n;
  '''

<img src="https://raw.githubusercontent.com/iamvarol/blogposts/main/databricks/images/lasix.png">

**patients using warfarin 2mg and up:**

In [0]:
query_string ='''
MATCH (p:Patient)-[rel_rx]->(rx:RxNorm)-[rel_d]->(d:Drug)-[rel_n:STRENGTH]->(n:NER)
WHERE toLower(d.name) CONTAINS 'warfarin'
      AND rel_n.patient_name=p.name
      AND rel_n.date=rel_rx.date 
      AND toInteger(left(n.name,1)) >=2
RETURN  DISTINCT
      d.name as drug_generic_name,
      rx.code as rxnorm_code,
      p.name as patient_name,
      n.name as strength,
      rel_rx.date as date
'''
df = pd.DataFrame([dict(_) for _ in conn.query(query_string)])
df

Unnamed: 0,drug_generic_name,rxnorm_code,patient_name,strength,date
0,warfarin sodium 2.5 mg [coumadin],855313,19823,2.5 mg,2167-11-27


**dangerous drug combinations:**

In [0]:
query_string ='''
WITH ["ibuprofen", "naproxen", "diclofenac", "indometacin", "ketorolac", "aspirin", "ketoprofen", "dexketoprofen", "meloxicam"] AS nsaids
MATCH (p:Patient)-[r1:RXNORM_CODE]->(rx:RxNorm)-[r2]->(d:Drug)
WHERE any(word IN nsaids WHERE d.name CONTAINS word) 
WITH DISTINCT p.name as patients, COLLECT(DISTINCT d.name) as nsaid_drugs, COUNT(DISTINCT d.name) as num_nsaids
MATCH (p:Patient)-[r1:RXNORM_CODE]->(rx:RxNorm)-[r2]->(d:Drug)
WHERE p.name=patients AND d.name CONTAINS 'warfarin'
RETURN DISTINCT patients, 
                nsaid_drugs, 
                num_nsaids, 
                d.name as warfarin_drug, 
                r1.date as date
'''

df = pd.DataFrame([dict(_) for _ in conn.query(query_string)])
df

Unnamed: 0,patients,nsaid_drugs,num_nsaids,warfarin_drug,date
0,19823,"[diclofenac potassium 25 mg [zipsor], aspirin ...",9,warfarin,2172-06-22
1,19823,"[diclofenac potassium 25 mg [zipsor], aspirin ...",9,warfarin sodium 1 mg,2172-06-22
2,19823,"[diclofenac potassium 25 mg [zipsor], aspirin ...",9,warfarin sodium 1 mg [coumadin],2170-10-12
3,19823,"[diclofenac potassium 25 mg [zipsor], aspirin ...",9,warfarin sodium 2.5 mg [coumadin],2167-11-27


In [0]:
# To obtain the visualization below, run the following query in Neo4j Browser:

query_string = '''
    WITH ["ibuprofen", "naproxen", "diclofenac", "indometacin", "ketorolac", "aspirin", "ketoprofen", "dexketoprofen", "meloxicam"] AS nsaids
    MATCH (p:Patient)-[r1:RXNORM_CODE]->(rx:RxNorm)-[r2]->(d:Drug)
    WHERE any(word IN nsaids WHERE d.name CONTAINS word) 
    WITH DISTINCT p.name as patients, COLLECT(DISTINCT d.name) as nsaid_drugs, COUNT(DISTINCT d.name) as num_nsaids
    MATCH (p:Patient)-[r1:RXNORM_CODE]->(rx:RxNorm)-[r2]->(d:Drug)
    WHERE p.name=patients AND d.name CONTAINS 'warfarin'
    RETURN p, rx, d, r1, r2;
    '''

<img src="https://raw.githubusercontent.com/iamvarol/blogposts/main/databricks/images/ddc.png">

**patients who underwent a hernia repair or appendectomy, or cholecystectomy:**

In [0]:
query_string = """
MATCH (pcd1:Procedure)-[rel1:IS_PROCEDURE]-(pati1:Patient)
WHERE pcd1.name CONTAINS 'hernia repair' OR pcd1.name CONTAINS 'appendectomy' OR pcd1.name CONTAINS 'cholecystectomy'
RETURN DISTINCT pati1.name as patients, 
                COLLECT(DISTINCT toLower(pcd1.name)) as procedures
"""

df = pd.DataFrame([dict(_) for _ in conn.query(query_string)])
df

Unnamed: 0,patients,procedures
0,21153,"[hernia repair, appendectomy]"


In [0]:
# patients with chest pain and shortness of breath
query_string = """
MATCH (p1:Patient)-[r1:IS_SYMPTOM]->(s1:Symptom),
(p2:Patient)-[r2:IS_SYMPTOM]->(s2:Symptom)
WHERE s1.name CONTAINS "chest pain" AND s2.name CONTAINS "shortness of breath"
    AND p2.name=p1.name AND r2.date = r1.date
RETURN DISTINCT p1.name as patient, r1.date as date,s1.name as symptom1, s2.name as symptom2
ORDER BY patient
"""
df = pd.DataFrame([dict(_) for _ in conn.query(query_string)])
df

Unnamed: 0,patient,date,symptom1,symptom2
0,19823,2167-02-25,chest pain,shortness of breath
1,19823,2170-10-12,chest pain,shortness of breath


**patients with hypertension or diabetes with chest pain:**

In [0]:
query_string = """
MATCH (p:Patient)-[r:IS_SYMPTOM]->(s:Symptom),
(p1:Patient)-[r2:IS_DSD]->(_dsd:DSD)
WHERE s.name CONTAINS "chest pain" AND p1.name=p.name AND _dsd.name IN ['hypertension', 'diabetes'] AND r2.date=r.date
RETURN DISTINCT p.name as patient, r.date as date, _dsd.name as dsd, s.name as symptom
"""
df = pd.DataFrame([dict(_) for _ in conn.query(query_string)])
df

Unnamed: 0,patient,date,dsd,symptom
0,27386,2189-07-06,hypertension,chest pain
1,19823,2167-02-25,hypertension,chest pain


In [0]:
# To obtain the visualization below, run the following query in Neo4j Browser:

query_string = '''
  MATCH (p:Patient)-[r:IS_SYMPTOM]->(s:Symptom),
  (p1:Patient)-[r2:IS_DSD]->(_dsd:DSD)
  WHERE s.name CONTAINS "chest pain" AND p1.name=p.name AND _dsd.name IN ['hypertension', 'diabetes'] AND r2.date=r.date
  RETURN DISTINCT p, r, _dsd, s;
  '''

<img src="https://raw.githubusercontent.com/iamvarol/blogposts/main/databricks/images/chest_pain.png">

## License
Copyright / License info of the notebook. Copyright [2021] the Notebook Authors.  The source in this notebook is provided subject to the [Apache 2.0 License](https://spdx.org/licenses/Apache-2.0.html).  All included or referenced third party libraries are subject to the licenses set forth below.

|Library Name|Library License|Library License URL|Library Source URL|
| :-: | :-:| :-: | :-:|
|Pandas |BSD 3-Clause License| https://github.com/pandas-dev/pandas/blob/master/LICENSE | https://github.com/pandas-dev/pandas|
|Numpy |BSD 3-Clause License| https://github.com/numpy/numpy/blob/main/LICENSE.txt | https://github.com/numpy/numpy|
|Neo4j |Apache License 2.0|https://github.com/neo4j/neo4j/blob/4.4/LICENSE.txt|https://github.com/neo4j/neo4j|
|Apache Spark |Apache License 2.0| https://github.com/apache/spark/blob/master/LICENSE | https://github.com/apache/spark/tree/master/python/pyspark|
|BeautifulSoup|MIT License|https://www.crummy.com/software/BeautifulSoup/#Download|https://www.crummy.com/software/BeautifulSoup/bs4/download/|
|Requests|Apache License 2.0|https://github.com/psf/requests/blob/main/LICENSE|https://github.com/psf/requests|
|Spark NLP Display|Apache License 2.0|https://github.com/JohnSnowLabs/spark-nlp-display/blob/main/LICENSE|https://github.com/JohnSnowLabs/spark-nlp-display|
|Spark NLP |Apache License 2.0| https://github.com/JohnSnowLabs/spark-nlp/blob/master/LICENSE | https://github.com/JohnSnowLabs/spark-nlp|
|Spark NLP for Healthcare|[Proprietary license - John Snow Labs Inc.](https://www.johnsnowlabs.com/spark-nlp-health/) |NA|NA|




|Author|
|-|
|Databricks Inc.|
|John Snow Labs Inc.|

## Disclaimers
Databricks Inc. (“Databricks”) does not dispense medical, diagnosis, or treatment advice. This Solution Accelerator (“tool”) is for informational purposes only and may not be used as a substitute for professional medical advice, treatment, or diagnosis. This tool may not be used within Databricks to process Protected Health Information (“PHI”) as defined in the Health Insurance Portability and Accountability Act of 1996, unless you have executed with Databricks a contract that allows for processing PHI, an accompanying Business Associate Agreement (BAA), and are running this notebook within a HIPAA Account.  Please note that if you run this notebook within Azure Databricks, your contract with Microsoft applies.