# Patient Cohort Building with NLP and Knowledge Graphs

In this notebook, we will build a Knowledge Graph (KG) using Spark NLP relation extraction models and Neo4j. The main point of this notebook is to show creating a clinical knowledge graph using Spark NLP pretrained models. For this purpose, we will use pretrained relation extraction and NER models. After creating the knowledge graph, we will query the KG to get some insightful results.

[Cluster Setup](https://nlp.johnsnowlabs.com/docs/en/licensed_install#install-on-databricks)

**Initial Configurations**

In [0]:
import json
import os

from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel,Pipeline
from pyspark.sql import functions as F
from pyspark.sql.types import *

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings("ignore")
pd.set_option("display.max_colwidth",100)

print('sparknlp.version : ',sparknlp.version())
print('sparknlp_jsl.version : ',sparknlp_jsl.version())

spark

In [0]:
spark._jvm.com.johnsnowlabs.util.start.registerListenerAndStartRefresh()

## Download Medical Dataset

In this notebook, we will use the medical records in csv format.

In [0]:
notes_path='/FileStore/HLS/kg/data/'
delta_path='/FileStore/HLS/kg/delta/jsl/'

dbutils.fs.mkdirs(notes_path)
os.environ['notes_path']=f'/dbfs{notes_path}'

In [0]:
%sh
cd $notes_path
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/databricks/python/healthcare_case_studies/data/data.csv

In [0]:
dbutils.fs.ls(f'{notes_path}/')

## Read Data and Write to Bronze Delta Layer

There are 965 clinical records stored in delta table. We read the data and write the records into bronze delta tables.

In [0]:
df = pd.read_csv(f'{notes_path}/data.csv', sep=';')
df

Unnamed: 0,subject_id,date,text,gender,dateOfBirth
0,19823,2167-02-25,Admission Date: [**2167-2-16**] Discharge Date: [**2167-2-24**]\n\nDate of Birth: [**2...,F,2099-05-05
1,19823,2167-11-27,Admission Date: [**2167-11-27**] Discharge Date: [**2167-12-9**]\n\nDate of Birth: [**...,F,2099-05-05
2,19823,2170-10-12,Admission Date: [**2170-9-19**] Discharge Date: [**2170-10-12**]\n\nDate of Birt...,F,2099-05-05
3,19823,2172-06-22,Admission Date: [**2172-6-13**] Discharge Date: [**2172-6-22**]\n\nDate of Birth...,F,2099-05-05
4,19823,2167-12-07,PATIENT/TEST INFORMATION:\nIndication: Aortic valve disease. Shortness of breath.\nHeight: (in) ...,F,2099-05-05
...,...,...,...,...,...
960,70004,2182-06-14,[**2182-6-14**] 10:45 AM\n MR HEAD W & W/O CONTRAST Clip ...,M,2127-12-06
961,70004,2182-06-25,FDG TUMOR IMAGING (PET-CT) Clip # [**Clip Number (Radiology...,M,2127-12-06
962,70004,2182-08-05,[**2182-8-5**] 11:46 AM\n MR HEAD W & W/O CONTRAST Clip #...,M,2127-12-06
963,70004,2182-08-23,FDG TUMOR IMAGING (PET-CT) Clip # [**Clip Number (Radiology...,M,2127-12-06


In [0]:
sparkDF=spark.createDataFrame(df) 
sparkDF.printSchema()
sparkDF.show()

In [0]:
sparkDF.write.format('delta').mode('overwrite').save(f'{delta_path}/bronze/dataset')
display(dbutils.fs.ls(f'{delta_path}/bronze/dataset'))

## Posology RE Pipeline

### Posology Releation Extraction

Posology relation extraction pretrained model supports the following relatios:

DRUG-DOSAGE
DRUG-FREQUENCY
DRUG-ADE (Adversed Drug Events)
DRUG-FORM
DRUG-ROUTE
DRUG-DURATION
DRUG-REASON
DRUG=STRENGTH

The model has been validated against the posology dataset described in (Magge, Scotch, & Gonzalez-Hernandez, 2018).

| Relation | Recall | Precision | F1 | F1 (Magge, Scotch, & Gonzalez-Hernandez, 2018) |
| --- | --- | --- | --- | --- |
| DRUG-ADE | 0.66 | 1.00 | **0.80** | 0.76 |
| DRUG-DOSAGE | 0.89 | 1.00 | **0.94** | 0.91 |
| DRUG-DURATION | 0.75 | 1.00 | **0.85** | 0.92 |
| DRUG-FORM | 0.88 | 1.00 | **0.94** | 0.95* |
| DRUG-FREQUENCY | 0.79 | 1.00 | **0.88** | 0.90 |
| DRUG-REASON | 0.60 | 1.00 | **0.75** | 0.70 |
| DRUG-ROUTE | 0.79 | 1.00 | **0.88** | 0.95* |
| DRUG-STRENGTH | 0.95 | 1.00 | **0.98** | 0.97 |


*Magge, Scotch, Gonzalez-Hernandez (2018) collapsed DRUG-FORM and DRUG-ROUTE into a single relation.

In [0]:
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("documents")

sentencer = SentenceDetector()\
    .setInputCols(["documents"])\
    .setOutputCol("sentences")

tokenizer = sparknlp.annotators.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

words_embedder = WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

posology_ner = MedicalNerModel()\
    .pretrained("ner_posology", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ners")   

posology_ner_converter = NerConverterInternal() \
    .setInputCols(["sentences", "tokens", "ners"]) \
    .setOutputCol("ner_chunks")

dependency_parser = DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

reModel = RelationExtractionModel()\
    .pretrained("posology_re")\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("posology_relations")\
    .setMaxSyntacticDistance(4)

pipeline = Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer, 
    words_embedder, 
    pos_tagger, 
    posology_ner,
    posology_ner_converter,
    dependency_parser,
    reModel
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

In [0]:
results = model.transform(sparkDF)
results.printSchema()

In [0]:
results.select('posology_relations.metadata').show(5)

In [0]:
results.select('subject_id','date', F.explode(F.arrays_zip('posology_relations.result', 'posology_relations.metadata')).alias("cols")).show()

In [0]:
result_df = results.select('subject_id','date',F.explode(F.arrays_zip(results.posology_relations.result, results.posology_relations.metadata)).alias("cols")) \
                   .select('subject_id','date',F.expr("cols['0']").alias("relation"),
                                               F.expr("cols['1']['entity1']").alias("entity1"),
                                               F.expr("cols['1']['entity1_begin']").alias("entity1_begin"),
                                               F.expr("cols['1']['entity1_end']").alias("entity1_end"),
                                               F.expr("cols['1']['chunk1']").alias("chunk1"),
                                               F.expr("cols['1']['entity2']").alias("entity2"),
                                               F.expr("cols['1']['entity2_begin']").alias("entity2_begin"),
                                               F.expr("cols['1']['entity2_end']").alias("entity2_end"),
                                               F.expr("cols['1']['chunk2']").alias("chunk2"),
                                               F.expr("cols['1']['confidence']").alias("confidence"))

In [0]:
result_df.show()

In [0]:
pd_result = result_df.toPandas()
pd_result

Unnamed: 0,subject_id,date,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,19823,2167-02-25,DRUG-FORM,DRUG,1391,1399,Albuterol,FORM,1414,1423,nebulizers,1.0
1,19823,2167-02-25,DRUG-FORM,DRUG,1405,1412,Atrovent,FORM,1414,1423,nebulizers,1.0
2,19823,2167-02-25,STRENGTH-DRUG,STRENGTH,1539,1543,40 mg,DRUG,1551,1555,Lasix,1.0
3,19823,2167-02-25,ROUTE-DRUG,ROUTE,1548,1549,IV,DRUG,1551,1555,Lasix,1.0
4,19823,2167-02-25,DRUG-STRENGTH,DRUG,2336,2341,Amaryl,STRENGTH,2343,2348,2.0 mg,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
2777,70004,2182-08-05,ROUTE-DRUG,ROUTE,545,546,IV,DRUG,548,555,contrast,1.0
2778,70004,2182-08-05,DOSAGE-DRUG,DOSAGE,942,946,20 cc,DRUG,951,959,Magnevist,1.0
2779,70004,2182-08-05,DRUG-ROUTE,DRUG,951,959,Magnevist,ROUTE,961,971,intravenous,1.0
2780,70004,2182-08-16,ROUTE-DRUG,ROUTE,475,476,IV,DRUG,478,485,CONTRAST,1.0


In [0]:
outname = 'posology_re_results.csv'
outdir = f'{delta_path}/silver/dataset'
pd_result.to_csv(outdir+outname, index=False, encoding="utf-8")
display(dbutils.fs.ls(outdir))

In [0]:
temp = pd.read_csv('/dbfs/FileStore/posology_re_results.csv')
temp

Unnamed: 0,subject_id,date,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,19823,2167-02-25,DRUG-FORM,DRUG,1391,1399,Albuterol,FORM,1414,1423,nebulizers,1.0
1,19823,2167-02-25,DRUG-FORM,DRUG,1405,1412,Atrovent,FORM,1414,1423,nebulizers,1.0
2,19823,2167-02-25,STRENGTH-DRUG,STRENGTH,1539,1543,40 mg,DRUG,1551,1555,Lasix,1.0
3,19823,2167-02-25,ROUTE-DRUG,ROUTE,1548,1549,IV,DRUG,1551,1555,Lasix,1.0
4,19823,2167-02-25,DRUG-STRENGTH,DRUG,2336,2341,Amaryl,STRENGTH,2343,2348,2.0 mg,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
2777,70004,2182-08-05,ROUTE-DRUG,ROUTE,545,546,IV,DRUG,548,555,contrast,1.0
2778,70004,2182-08-05,DOSAGE-DRUG,DOSAGE,942,946,20 cc,DRUG,951,959,Magnevist,1.0
2779,70004,2182-08-05,DRUG-ROUTE,DRUG,951,959,Magnevist,ROUTE,961,971,intravenous,1.0
2780,70004,2182-08-16,ROUTE-DRUG,ROUTE,475,476,IV,DRUG,478,485,CONTRAST,1.0


## RxNorm Code Extraction From Re_Results

In [0]:
import pandas as pd

outname = 'posology_re_results.csv'
outdir = f'{delta_path}/silver/dataset'
pd_RE = pd.read_csv(outdir+outname, index=False, encoding="utf-8")
pd_RE

Unnamed: 0,subject_id,date,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,19823,2167-02-25,DRUG-FORM,DRUG,1391,1399,Albuterol,FORM,1414,1423,nebulizers,1.0
1,19823,2167-02-25,DRUG-FORM,DRUG,1405,1412,Atrovent,FORM,1414,1423,nebulizers,1.0
2,19823,2167-02-25,STRENGTH-DRUG,STRENGTH,1539,1543,40 mg,DRUG,1551,1555,Lasix,1.0
3,19823,2167-02-25,ROUTE-DRUG,ROUTE,1548,1549,IV,DRUG,1551,1555,Lasix,1.0
4,19823,2167-02-25,DRUG-STRENGTH,DRUG,2336,2341,Amaryl,STRENGTH,2343,2348,2.0 mg,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
2777,70004,2182-08-05,ROUTE-DRUG,ROUTE,545,546,IV,DRUG,548,555,contrast,1.0
2778,70004,2182-08-05,DOSAGE-DRUG,DOSAGE,942,946,20 cc,DRUG,951,959,Magnevist,1.0
2779,70004,2182-08-05,DRUG-ROUTE,DRUG,951,959,Magnevist,ROUTE,961,971,intravenous,1.0
2780,70004,2182-08-16,ROUTE-DRUG,ROUTE,475,476,IV,DRUG,478,485,CONTRAST,1.0


In [0]:
sp_RE = spark.createDataFrame(pd_RE)
sp_RE.show(20)

In [0]:
sp_RE.rdd.getNumPartitions()

In [0]:
# drug + strength or form
from pyspark.sql.functions import when, col

sp_RE_results = sp_RE.withColumn('rx_text',
  when( (F.col('entity1')=='DRUG') & ((F.col('entity2')=='FORM') | (F.col('entity2')=='STRENGTH') | (F.col('entity2')=='DOSAGE') ), F.concat(F.col('chunk1'),F.lit(' '), F.col('chunk2')))
 .when( ((F.col('entity1')=='FORM') | (F.col('entity1')=='STRENGTH') | (F.col('entity1')=='DOSAGE') ) & (F.col('entity2')=='DRUG'), F.concat(F.col('chunk2'),F.lit(' '), F.col('chunk1')))
 .when( (F.col('entity1')=='DRUG') & ((F.col('entity2')!='FORM') & (F.col('entity2')!='STRENGTH') & (F.col('entity2')!='DOSAGE') ), F.col('chunk1'))
 .when( (F.col('entity2')=='DRUG') & ((F.col('entity1')!='FORM') & (F.col('entity1')!='STRENGTH') & (F.col('entity1')!='DOSAGE') ), F.col('chunk2'))
                   .otherwise(F.lit(' '))
                   )

sp_RE_results.show(20,70)

In [0]:
documentAssembler = DocumentAssembler()\
      .setInputCol("rx_text")\
      .setOutputCol("ner_chunk")

sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("sentence_embeddings")
    
rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \
      .setInputCols(["ner_chunk", "sentence_embeddings"]) \
      .setOutputCol("rxnorm_code")\
      .setDistanceFunction("EUCLIDEAN")

rxnorm_pipelineModel = PipelineModel(
    stages = [
        documentAssembler,
        sbert_embedder,
        rxnorm_resolver])

In [0]:
rxnorm_results = rxnorm_pipelineModel.transform(sp_RE_results)
rxnorm_results.printSchema(), rxnorm_results.rdd.getNumPartitions()

In [0]:
sp_rxnorm_result = rxnorm_results.select('subject_id','date', 'relation', 'entity1', 'entity1_begin','entity1_end',  'chunk1', 'entity2', 'entity2_begin', 'entity2_end', 
                                         'chunk2', 'confidence', 'rx_text', 
                                         F.explode(F.arrays_zip(rxnorm_results.ner_chunk.result, 
                                                                rxnorm_results.ner_chunk.metadata, 
                                                                rxnorm_results.rxnorm_code.result, 
                                                                rxnorm_results.rxnorm_code.metadata)).alias("cols")) \
                                     .select('subject_id','date', 'relation', 'entity1', 'entity1_begin','entity1_end',  'chunk1', 'entity2', 'entity2_begin', 'entity2_end',
                                             'chunk2', 'confidence', 'rx_text',
                                             F.expr("cols['1']['sentence']").alias("sent_id"),
                                             F.expr("cols['0']").alias("ner_chunk"),
                                             F.expr("cols['1']['entity']").alias("entity"), 
                                             F.expr("cols['2']").alias('rxnorm_code'),
                                             F.expr("cols['3']['all_k_results']").alias("all_codes"),
                                             F.expr("cols['3']['all_k_resolutions']").alias("resolutions"))

In [0]:
sp_rxnorm_result.show()

In [0]:
sp_rxnorm_result = sp_rxnorm_result.withColumn('all_codes', F.split(F.col('all_codes'), ':::'))\
                                    .withColumn('resolutions', F.split(F.col('resolutions'), ':::'))\

sp_rxnorm_result.show()

In [0]:
pd_rxnorm_result = sp_rxnorm_result.toPandas()
pd_rxnorm_result

Unnamed: 0,subject_id,date,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence,rx_text,sent_id,ner_chunk,entity,rxnorm_code,all_codes,resolutions
0,19823,2167-02-25,DRUG-FORM,DRUG,1391,1399,Albuterol,FORM,1414,1423,nebulizers,1.0,Albuterol nebulizers,0,Albuterol nebulizers,,2108226,"[2108226, 1154602, 370790, 1154603, 2108233, 2108255, 2108276, 745678, 1163444, 2108246, 2108507...","[albuterol Inhalation Solution, albuterol Inhalant Product, albuterol Injectable Solution, albut..."
1,19823,2167-02-25,DRUG-FORM,DRUG,1405,1412,Atrovent,FORM,1414,1423,nebulizers,1.0,Atrovent nebulizers,0,Atrovent nebulizers,,2108451,"[2108451, 1173573, 379767, 1173576, 2463732, 1945043, 1172634, 1171309, 363357, 1184866, 1170108...","[ipratropium Inhalation Solution [Atrovent], Atrovent Inhalant Product, Atrovent Autohaler, Atro..."
2,19823,2167-02-25,STRENGTH-DRUG,STRENGTH,1539,1543,40 mg,DRUG,1551,1555,Lasix,1.0,Lasix 40 mg,0,Lasix 40 mg,,200809,"[200809, 617319, 103919, 1871459, 201286, 2556796, 1927858, 1648194, 977916, 352320, 208458, 173...","[furosemide 40 MG Oral Tablet [Lasix], atorvastatin 40 MG [Lipitor], fluvastatin 40 MG Oral Caps..."
3,19823,2167-02-25,ROUTE-DRUG,ROUTE,1548,1549,IV,DRUG,1551,1555,Lasix,1.0,Lasix,0,Lasix,,202991,"[202991, 151963, 2256936, 2256930, 1043720, 224946, 217961, 203783, 261550, 1013021, 606658, 218...","[Lasix, Lasma, lasmiditan Oral Tablet, lasmiditan, LidoWorx, Lidex, Laniroif, Lanoxicaps, Lanabi..."
4,19823,2167-02-25,DRUG-STRENGTH,DRUG,2336,2341,Amaryl,STRENGTH,2343,2348,2.0 mg,1.0,Amaryl 2.0 mg,0,Amaryl 2.0 mg,,901295,"[901295, 153591, 1310138, 213799, 2399657, 1036818, 998190, 1439900, 905270, 540140, 202295, 104...","[sodium fluoride 2.2 MG [Ludent], glimepiride 2 MG Oral Tablet [Amaryl], everolimus 2 MG Tablet ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2777,70004,2182-08-05,ROUTE-DRUG,ROUTE,545,546,IV,DRUG,548,555,contrast,1.0,contrast,0,contrast,,1592743,"[1592743, 202939, 23202, 66977, 2262255, 543455, 705766, 1436150, 744843, 216795, 1946584, 65874...","[Ofev, Dixarit, Dilor, Vascor, Scenesse, Durad, Appearex, Visco-Gel, Isentress, Duratest, Xhance..."
2778,70004,2182-08-05,DOSAGE-DRUG,DOSAGE,942,946,20 cc,DRUG,951,959,Magnevist,1.0,Magnevist 20 cc,0,Magnevist 20 cc,,208456,"[208456, 152893, 2286257, 617317, 664142, 1119558, 596927, 429343, 440810, 571777, 351387, 79386...","[tacrine 20 MG Oral Capsule [Cognex], sertindole 20 MG Oral Tablet [Serdolect], dexamethasone 20..."
2779,70004,2182-08-05,DRUG-ROUTE,DRUG,951,959,Magnevist,ROUTE,961,971,intravenous,1.0,Magnevist,0,Magnevist,,196214,"[196214, 2475179, 406156, 991881, 6574, 218204, 218250, 218167, 797858, 1043619, 152000, 218245,...","[Magnesiocard, magnesite, MagneBind, Maracyn Plus, magnesium, Maoson, Maxaquin, Magagel Plus, Ma..."
2780,70004,2182-08-16,ROUTE-DRUG,ROUTE,475,476,IV,DRUG,478,485,CONTRAST,1.0,CONTRAST,0,CONTRAST,,799044,"[799044, 153381, 385716, 216281, 668395, 1013644, 216253, 284702, 1188463, 2264346, 323984, 2158...","[Cotab A, Cozaar-Comp, Cesamet, Crolom, Certuss, Cidaflex, Cosopt, Colocort, Citravet, belladonn..."


In [0]:
outname = 'posology_RE_rxnorm_results.csv'
outdir = f'{delta_path}/silver/dataset'
pd_rxnorm_result.to_csv(outdir+outname, index=False, encoding="utf-8")
display(dbutils.fs.ls(outdir))

### Split Resolutions to Resolution Drug and Write Results to Golden Delta Layer

In [0]:
outname = 'posology_RE_rxnorm_results.csv'
outdir = f'{delta_path}/silver/dataset'
df = pd.read_csv(outdir+outname)
df

Unnamed: 0,subject_id,date,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence,rx_text,sent_id,ner_chunk,entity,rxnorm_code,all_codes,resolutions
0,19823,2167-02-25,DRUG-FORM,DRUG,1391,1399,Albuterol,FORM,1414,1423,nebulizers,1.0,Albuterol nebulizers,0,Albuterol nebulizers,,2108226,['2108226' '1154602' '370790' '1154603' '2108233' '2108255' '2108276'\n '745678' '1163444' '2108...,['albuterol Inhalation Solution' 'albuterol Inhalant Product'\n 'albuterol Injectable Solution' ...
1,19823,2167-02-25,DRUG-FORM,DRUG,1405,1412,Atrovent,FORM,1414,1423,nebulizers,1.0,Atrovent nebulizers,0,Atrovent nebulizers,,2108451,['2108451' '1173573' '379767' '1173576' '2463732' '1945043' '1172634'\n '1171309' '363357' '1184...,['ipratropium Inhalation Solution [Atrovent]' 'Atrovent Inhalant Product'\n 'Atrovent Autohaler'...
2,19823,2167-02-25,STRENGTH-DRUG,STRENGTH,1539,1543,40 mg,DRUG,1551,1555,Lasix,1.0,Lasix 40 mg,0,Lasix 40 mg,,200809,['200809' '617319' '103919' '1871459' '201286' '2556796' '1927858'\n '1648194' '977916' '352320'...,['furosemide 40 MG Oral Tablet [Lasix]' 'atorvastatin 40 MG [Lipitor]'\n 'fluvastatin 40 MG Oral...
3,19823,2167-02-25,ROUTE-DRUG,ROUTE,1548,1549,IV,DRUG,1551,1555,Lasix,1.0,Lasix,0,Lasix,,202991,['202991' '151963' '2256936' '2256930' '1043720' '224946' '217961'\n '203783' '261550' '1013021'...,['Lasix' 'Lasma' 'lasmiditan Oral Tablet' 'lasmiditan' 'LidoWorx' 'Lidex'\n 'Laniroif' 'Lanoxica...
4,19823,2167-02-25,DRUG-STRENGTH,DRUG,2336,2341,Amaryl,STRENGTH,2343,2348,2.0 mg,1.0,Amaryl 2.0 mg,0,Amaryl 2.0 mg,,901295,['901295' '153591' '1310138' '213799' '2399657' '1036818' '998190'\n '1439900' '905270' '540140'...,['sodium fluoride 2.2 MG [Ludent]' 'glimepiride 2 MG Oral Tablet [Amaryl]'\n 'everolimus 2 MG Ta...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2777,70004,2182-08-05,ROUTE-DRUG,ROUTE,545,546,IV,DRUG,548,555,contrast,1.0,contrast,0,contrast,,1592743,['1592743' '202939' '23202' '66977' '2262255' '543455' '705766' '1436150'\n '744843' '216795' '1...,['Ofev' 'Dixarit' 'Dilor' 'Vascor' 'Scenesse' 'Durad' 'Appearex'\n 'Visco-Gel' 'Isentress' 'Dura...
2778,70004,2182-08-05,DOSAGE-DRUG,DOSAGE,942,946,20 cc,DRUG,951,959,Magnevist,1.0,Magnevist 20 cc,0,Magnevist 20 cc,,208456,['208456' '152893' '2286257' '617317' '664142' '1119558' '596927' '429343'\n '440810' '571777' '...,['tacrine 20 MG Oral Capsule [Cognex]'\n 'sertindole 20 MG Oral Tablet [Serdolect]' 'dexamethaso...
2779,70004,2182-08-05,DRUG-ROUTE,DRUG,951,959,Magnevist,ROUTE,961,971,intravenous,1.0,Magnevist,0,Magnevist,,196214,['196214' '2475179' '406156' '991881' '6574' '218204' '218250' '218167'\n '797858' '1043619' '15...,['Magnesiocard' 'magnesite' 'MagneBind' 'Maracyn Plus' 'magnesium'\n 'Maoson' 'Maxaquin' 'Magage...
2780,70004,2182-08-16,ROUTE-DRUG,ROUTE,475,476,IV,DRUG,478,485,CONTRAST,1.0,CONTRAST,0,CONTRAST,,799044,['799044' '153381' '385716' '216281' '668395' '1013644' '216253' '284702'\n '1188463' '2264346' ...,['Cotab A' 'Cozaar-Comp' 'Cesamet' 'Crolom' 'Certuss' 'Cidaflex' 'Cosopt'\n 'Colocort' 'Citravet...


In [0]:
df['res']=df['resolutions'].str.split(' ').str[0]
df.res.head()

In [0]:
df['resolution'] = [val[2:] for val in df['res']]
df['resolution'].head()

In [0]:
df['drug_resolution'] = df['resolution'].str.split().str.get(0)
df['drug_resolution'] = df['drug_resolution'].replace({',':''}, regex=True)
df['drug_resolution'] = df['drug_resolution'].replace({"'":""}, regex=True)
df.head(20)

Unnamed: 0,subject_id,date,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence,rx_text,sent_id,ner_chunk,entity,rxnorm_code,all_codes,resolutions,res,resolution,drug_resolution
0,19823,2167-02-25,DRUG-FORM,DRUG,1391,1399,Albuterol,FORM,1414,1423,nebulizers,1.0,Albuterol nebulizers,0,Albuterol nebulizers,,2108226,['2108226' '1154602' '370790' '1154603' '2108233' '2108255' '2108276'\n '745678' '1163444' '2108...,['albuterol Inhalation Solution' 'albuterol Inhalant Product'\n 'albuterol Injectable Solution' ...,['albuterol,albuterol,albuterol
1,19823,2167-02-25,DRUG-FORM,DRUG,1405,1412,Atrovent,FORM,1414,1423,nebulizers,1.0,Atrovent nebulizers,0,Atrovent nebulizers,,2108451,['2108451' '1173573' '379767' '1173576' '2463732' '1945043' '1172634'\n '1171309' '363357' '1184...,['ipratropium Inhalation Solution [Atrovent]' 'Atrovent Inhalant Product'\n 'Atrovent Autohaler'...,['ipratropium,ipratropium,ipratropium
2,19823,2167-02-25,STRENGTH-DRUG,STRENGTH,1539,1543,40 mg,DRUG,1551,1555,Lasix,1.0,Lasix 40 mg,0,Lasix 40 mg,,200809,['200809' '617319' '103919' '1871459' '201286' '2556796' '1927858'\n '1648194' '977916' '352320'...,['furosemide 40 MG Oral Tablet [Lasix]' 'atorvastatin 40 MG [Lipitor]'\n 'fluvastatin 40 MG Oral...,['furosemide,furosemide,furosemide
3,19823,2167-02-25,ROUTE-DRUG,ROUTE,1548,1549,IV,DRUG,1551,1555,Lasix,1.0,Lasix,0,Lasix,,202991,['202991' '151963' '2256936' '2256930' '1043720' '224946' '217961'\n '203783' '261550' '1013021'...,['Lasix' 'Lasma' 'lasmiditan Oral Tablet' 'lasmiditan' 'LidoWorx' 'Lidex'\n 'Laniroif' 'Lanoxica...,['Lasix',Lasix',Lasix
4,19823,2167-02-25,DRUG-STRENGTH,DRUG,2336,2341,Amaryl,STRENGTH,2343,2348,2.0 mg,1.0,Amaryl 2.0 mg,0,Amaryl 2.0 mg,,901295,['901295' '153591' '1310138' '213799' '2399657' '1036818' '998190'\n '1439900' '905270' '540140'...,['sodium fluoride 2.2 MG [Ludent]' 'glimepiride 2 MG Oral Tablet [Amaryl]'\n 'everolimus 2 MG Ta...,['sodium,sodium,sodium
5,19823,2167-02-25,DRUG-ROUTE,DRUG,2336,2341,Amaryl,ROUTE,2350,2351,po,1.0,Amaryl,0,Amaryl,,215221,['215221' '135820' '151348' '215203' '153592' '152800' '215200' '151345'\n '131725' '215206' '83...,['Amilac' 'Aventyl' 'Amytal' 'Amcort' 'Amaryl' 'Amilamont' 'Ambenyl'\n 'Amoram' 'Ambien' 'Americ...,['Amilac',Amilac',Amilac
6,19823,2167-02-25,DRUG-FREQUENCY,DRUG,2336,2341,Amaryl,FREQUENCY,2353,2355,bid,1.0,Amaryl,0,Amaryl,,215221,['215221' '135820' '151348' '215203' '153592' '152800' '215200' '151345'\n '131725' '215206' '83...,['Amilac' 'Aventyl' 'Amytal' 'Amcort' 'Amaryl' 'Amilamont' 'Ambenyl'\n 'Amoram' 'Ambien' 'Americ...,['Amilac',Amilac',Amilac
7,19823,2167-02-25,DRUG-STRENGTH,DRUG,2336,2341,Amaryl,STRENGTH,2372,2379,"1,000 mg",1.0,"Amaryl 1,000 mg",0,"Amaryl 1,000 mg",,106248,['106248' '1549223' '1654725' '1298448' '282828' '885214' '1312717'\n '417424' '409160' '1293504...,['hydrocortisone 1 MG/ML Topical Cream' 'lidocaine 10 MG/ML Topical Spray'\n 'glycerin 250 MG/ML...,['hydrocortisone,hydrocortisone,hydrocortisone
8,19823,2167-02-25,DRUG-FREQUENCY,DRUG,2336,2341,Amaryl,FREQUENCY,2384,2386,bid,1.0,Amaryl,0,Amaryl,,215221,['215221' '135820' '151348' '215203' '153592' '152800' '215200' '151345'\n '131725' '215206' '83...,['Amilac' 'Aventyl' 'Amytal' 'Amcort' 'Amaryl' 'Amilamont' 'Ambenyl'\n 'Amoram' 'Ambien' 'Americ...,['Amilac',Amilac',Amilac
9,19823,2167-02-25,STRENGTH-DRUG,STRENGTH,2343,2348,2.0 mg,DRUG,2361,2370,Glucophage,1.0,Glucophage 2.0 mg,0,Glucophage 2.0 mg,,865570,['865570' '201058' '1855336' '2001263' '205490' '808502' '996825' '199176'\n '999493' '315321' '...,['glipizide 2.5 MG [Glucotrol]' 'glyburide 2.5 MG Oral Tablet [Euglucon]'\n 'omeprazole 2.5 MG [...,['glipizide,glipizide,glipizide


In [0]:
df['drug_resolution'] = df['drug_resolution'].str.lower()
df['chunk1'] = df['chunk1'].str.lower()
df['chunk2'] = df['chunk2'].str.lower()
df.head(20)

Unnamed: 0,subject_id,date,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence,rx_text,sent_id,ner_chunk,entity,rxnorm_code,all_codes,resolutions,res,resolution,drug_resolution
0,19823,2167-02-25,DRUG-FORM,DRUG,1391,1399,albuterol,FORM,1414,1423,nebulizers,1.0,Albuterol nebulizers,0,Albuterol nebulizers,,2108226,['2108226' '1154602' '370790' '1154603' '2108233' '2108255' '2108276'\n '745678' '1163444' '2108...,['albuterol Inhalation Solution' 'albuterol Inhalant Product'\n 'albuterol Injectable Solution' ...,['albuterol,albuterol,albuterol
1,19823,2167-02-25,DRUG-FORM,DRUG,1405,1412,atrovent,FORM,1414,1423,nebulizers,1.0,Atrovent nebulizers,0,Atrovent nebulizers,,2108451,['2108451' '1173573' '379767' '1173576' '2463732' '1945043' '1172634'\n '1171309' '363357' '1184...,['ipratropium Inhalation Solution [Atrovent]' 'Atrovent Inhalant Product'\n 'Atrovent Autohaler'...,['ipratropium,ipratropium,ipratropium
2,19823,2167-02-25,STRENGTH-DRUG,STRENGTH,1539,1543,40 mg,DRUG,1551,1555,lasix,1.0,Lasix 40 mg,0,Lasix 40 mg,,200809,['200809' '617319' '103919' '1871459' '201286' '2556796' '1927858'\n '1648194' '977916' '352320'...,['furosemide 40 MG Oral Tablet [Lasix]' 'atorvastatin 40 MG [Lipitor]'\n 'fluvastatin 40 MG Oral...,['furosemide,furosemide,furosemide
3,19823,2167-02-25,ROUTE-DRUG,ROUTE,1548,1549,iv,DRUG,1551,1555,lasix,1.0,Lasix,0,Lasix,,202991,['202991' '151963' '2256936' '2256930' '1043720' '224946' '217961'\n '203783' '261550' '1013021'...,['Lasix' 'Lasma' 'lasmiditan Oral Tablet' 'lasmiditan' 'LidoWorx' 'Lidex'\n 'Laniroif' 'Lanoxica...,['Lasix',Lasix',lasix
4,19823,2167-02-25,DRUG-STRENGTH,DRUG,2336,2341,amaryl,STRENGTH,2343,2348,2.0 mg,1.0,Amaryl 2.0 mg,0,Amaryl 2.0 mg,,901295,['901295' '153591' '1310138' '213799' '2399657' '1036818' '998190'\n '1439900' '905270' '540140'...,['sodium fluoride 2.2 MG [Ludent]' 'glimepiride 2 MG Oral Tablet [Amaryl]'\n 'everolimus 2 MG Ta...,['sodium,sodium,sodium
5,19823,2167-02-25,DRUG-ROUTE,DRUG,2336,2341,amaryl,ROUTE,2350,2351,po,1.0,Amaryl,0,Amaryl,,215221,['215221' '135820' '151348' '215203' '153592' '152800' '215200' '151345'\n '131725' '215206' '83...,['Amilac' 'Aventyl' 'Amytal' 'Amcort' 'Amaryl' 'Amilamont' 'Ambenyl'\n 'Amoram' 'Ambien' 'Americ...,['Amilac',Amilac',amilac
6,19823,2167-02-25,DRUG-FREQUENCY,DRUG,2336,2341,amaryl,FREQUENCY,2353,2355,bid,1.0,Amaryl,0,Amaryl,,215221,['215221' '135820' '151348' '215203' '153592' '152800' '215200' '151345'\n '131725' '215206' '83...,['Amilac' 'Aventyl' 'Amytal' 'Amcort' 'Amaryl' 'Amilamont' 'Ambenyl'\n 'Amoram' 'Ambien' 'Americ...,['Amilac',Amilac',amilac
7,19823,2167-02-25,DRUG-STRENGTH,DRUG,2336,2341,amaryl,STRENGTH,2372,2379,"1,000 mg",1.0,"Amaryl 1,000 mg",0,"Amaryl 1,000 mg",,106248,['106248' '1549223' '1654725' '1298448' '282828' '885214' '1312717'\n '417424' '409160' '1293504...,['hydrocortisone 1 MG/ML Topical Cream' 'lidocaine 10 MG/ML Topical Spray'\n 'glycerin 250 MG/ML...,['hydrocortisone,hydrocortisone,hydrocortisone
8,19823,2167-02-25,DRUG-FREQUENCY,DRUG,2336,2341,amaryl,FREQUENCY,2384,2386,bid,1.0,Amaryl,0,Amaryl,,215221,['215221' '135820' '151348' '215203' '153592' '152800' '215200' '151345'\n '131725' '215206' '83...,['Amilac' 'Aventyl' 'Amytal' 'Amcort' 'Amaryl' 'Amilamont' 'Ambenyl'\n 'Amoram' 'Ambien' 'Americ...,['Amilac',Amilac',amilac
9,19823,2167-02-25,STRENGTH-DRUG,STRENGTH,2343,2348,2.0 mg,DRUG,2361,2370,glucophage,1.0,Glucophage 2.0 mg,0,Glucophage 2.0 mg,,865570,['865570' '201058' '1855336' '2001263' '205490' '808502' '996825' '199176'\n '999493' '315321' '...,['glipizide 2.5 MG [Glucotrol]' 'glyburide 2.5 MG Oral Tablet [Euglucon]'\n 'omeprazole 2.5 MG [...,['glipizide,glipizide,glipizide


In [0]:
outname = 'posology_RE_rxnorm_w_drug_resolutions.csv'
outdir = f'{delta_path}/golden/dataset'
df.to_csv(outdir+outname, index=False, encoding="utf-8")
display(dbutils.fs.ls(outdir))

## NER JSL Slim

Model card of the ner_jsl_slim is [here](https://nlp.johnsnowlabs.com/2021/08/13/ner_jsl_slim_en.html).

In [0]:
documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

sentenceDetector = SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")\
      .setCustomBounds(["\|"])

tokenizer = Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")\

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

jsl_ner = MedicalNerModel.pretrained("ner_jsl_slim", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")

jsl_converter = NerConverter() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(['Symptom','Body_Part', 'Procedure', 'Disease_Syndrome_Disorder', 'Test'])

ner_pipeline = Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        jsl_ner,
        jsl_converter
        ])

data_ner = spark.createDataFrame([[""]]).toDF("text")
model = ner_pipeline.fit(data_ner)

In [0]:
results = model.transform(sparkDF)
results.printSchema()

In [0]:
result_df = results.select('subject_id','date',
                           F.explode(F.arrays_zip(results.ner_chunk.result, results.ner_chunk.begin, results.ner_chunk.end, results.ner_chunk.metadata)).alias("cols")) \
                    .select('subject_id','date',
                            F.expr("cols['3']['sentence']").alias("sentence_id"),
                            F.expr("cols['0']").alias("chunk"),
                            F.expr("cols['1']").alias("begin"),
                            F.expr("cols['2']").alias("end"),
                            F.expr("cols['3']['entity']").alias("ner_label"))\
                    .filter("ner_label!='O'")

In [0]:
result_df.show()

In [0]:
pd_result = result_df.toPandas()
pd_result

Unnamed: 0,subject_id,date,sentence_id,chunk,begin,end,ner_label
0,19823,2167-02-25,0,Shortness of breath,178,196,Symptom
1,19823,2167-02-25,0,cough,199,203,Symptom
2,19823,2167-02-25,1,diabetes type II,345,360,Disease_Syndrome_Disorder
3,19823,2167-02-25,1,congestive heart failure,363,386,Disease_Syndrome_Disorder
4,19823,2167-02-25,1,hypertension,413,424,Disease_Syndrome_Disorder
...,...,...,...,...,...,...,...
18456,70004,2182-08-16,13,Multilevel degenerative changes,1860,1890,Symptom
18457,70004,2182-08-16,13,uncovertebral joint hypertrophy,1897,1927,Disease_Syndrome_Disorder
18458,70004,2182-08-16,14,metastatic disease,1966,1983,Oncological
18459,70004,2182-08-16,14,cervical spine,1992,2005,Body_Part


In [0]:
outname = 'ner_jsl_slim_results.csv'
outdir = f'{delta_path}/golden/dataset'
pd_result.to_csv(outdir+outname, index=False, encoding="utf-8")
display(dbutils.fs.ls(outdir))

## License
Copyright / License info of the notebook. Copyright [2021] the Notebook Authors.  The source in this notebook is provided subject to the [Apache 2.0 License](https://spdx.org/licenses/Apache-2.0.html).  All included or referenced third party libraries are subject to the licenses set forth below.

|Library Name|Library License|Library License URL|Library Source URL|
| :-: | :-:| :-: | :-:|
|Pandas |BSD 3-Clause License| https://github.com/pandas-dev/pandas/blob/master/LICENSE | https://github.com/pandas-dev/pandas|
|Numpy |BSD 3-Clause License| https://github.com/numpy/numpy/blob/main/LICENSE.txt | https://github.com/numpy/numpy|
|Apache Spark |Apache License 2.0| https://github.com/apache/spark/blob/master/LICENSE | https://github.com/apache/spark/tree/master/python/pyspark|
|BeautifulSoup|MIT License|https://www.crummy.com/software/BeautifulSoup/#Download|https://www.crummy.com/software/BeautifulSoup/bs4/download/|
|Requests|Apache License 2.0|https://github.com/psf/requests/blob/main/LICENSE|https://github.com/psf/requests|
|Spark NLP Display|Apache License 2.0|https://github.com/JohnSnowLabs/spark-nlp-display/blob/main/LICENSE|https://github.com/JohnSnowLabs/spark-nlp-display|
|Spark NLP |Apache License 2.0| https://github.com/JohnSnowLabs/spark-nlp/blob/master/LICENSE | https://github.com/JohnSnowLabs/spark-nlp|
|Spark NLP for Healthcare|[Proprietary license - John Snow Labs Inc.](https://www.johnsnowlabs.com/spark-nlp-health/) |NA|NA|




|Author|
|-|
|Databricks Inc.|
|John Snow Labs Inc.|

## Disclaimers
Databricks Inc. (“Databricks”) does not dispense medical, diagnosis, or treatment advice. This Solution Accelerator (“tool”) is for informational purposes only and may not be used as a substitute for professional medical advice, treatment, or diagnosis. This tool may not be used within Databricks to process Protected Health Information (“PHI”) as defined in the Health Insurance Portability and Accountability Act of 1996, unless you have executed with Databricks a contract that allows for processing PHI, an accompanying Business Associate Agreement (BAA), and are running this notebook within a HIPAA Account.  Please note that if you run this notebook within Azure Databricks, your contract with Microsoft applies.