![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# 3. Clinical Entity Resolvers v2.7.0

In [0]:
import os
import json
import string
import numpy as np
import pandas as pd


import sparknlp
import sparknlp_jsl
from sparknlp.base import *
from sparknlp.util import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.pretrained import ResourceDownloader

from pyspark.sql import functions as F
from pyspark.ml import Pipeline, PipelineModel

pd.set_option('max_colwidth', 100)
pd.set_option('display.max_columns', 100)  
pd.set_option('display.expand_frame_repr', False)


print('sparknlp_jsl.version : ',sparknlp_jsl.version())

spark

# Clinical Resolvers

## Entity Resolvers for ICD-10

A common NLP problem in biomedical aplications is to identify the presence of clinical entities in a given text. This clinical entities could be diseases, symptoms, drugs, results of clinical investigations or others.

Other than providing the code in the "result" field it provides more metadata about the matching process:

- target_text -> Text to resolve
- resolved_text -> Best match text
- confidence -> Relative confidence for the top match (distance to probability)
- confidence_ratio -> Relative confidence for the top match. TopMatchConfidence / SecondMatchConfidence
- alternative_codes -> List of other plausible codes (in the KNN neighborhood)
- alternative_confidence_ratios -> Rest of confidence ratios
- all_k_results -> All resolved codes for metrics calculation purposes
- sentence -> SentenceId
- chunk -> ChunkId

### Clinical NER Pipeline creation

In [0]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

# Sentence Detector DL annotator, processes various sentences per line
sentenceDetectorDL = SentenceDetectorDLModel\
  .pretrained("sentence_detector_dl_healthcare", "en", 'clinical/models') \
  .setInputCols(["document"]) \
  .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP

tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("raw_token")\

# Tokenizer splits words in a relevant format for NLP

stopwords = StopWordsCleaner()\
  .setInputCols(["raw_token"])\
  .setOutputCol("token")
  

The fourth annotator in the pipeline is "WordEmbeddingsModel". We will download a pretrained model available from "clinical/models" named "embeddings_clinical".

When running this cell your are advised to be patient.

First time you call this pretrained model it needs to be downloaded in your local.

The model size is about will download the embeddings_clinical corpus it takes a while.

The size is about 1.7Gb and will be saved typically in your home folder as

`~HOMEFOLDER/cached_models/ embeddings_clinical_en_2.0.2_2.4_1558454742956`

Next times you call it the model is loaded from your cached copy but even in that case it needs to be indexed each time so expect waiting up to 5 minutes (depending on your machine)

In [0]:
# WordEmbeddingsModel pretrained "embeddings_clinical" includes a model of 1.7Gb that needs to be downloaded

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")
  

The fifth and final annotator in our NER pipeline is the pretrained `ner_clinical` NerDLModel avaliable from "clinical/models". It requires as input the "sentence", "token" and "embeddings" (clinical embeddings pretrained model) and will classify each token in four categories:

- `PROBLEM`: for patient problems

- `TEST`: for tests, labs, etc.

- `TREATMENT`: for treatments, medicines, etc.

- `OTHER`: for the rest of tokens.

In order to split those identified NER that are consecutive, the B prefix (as B-PROBLEM) will be used at the first token of each NER. The I prefix (as I-PROBLEM) will be used for the rest of tokens inside the NER.

In [0]:
# Named Entity Recognition for clinical concepts.

clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")


### Define the NER pipeline

Now we will define the actual pipeline that puts together the annotators we have created.

In [0]:
# Build up the pipeline

pipeline_ner = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetectorDL,
    tokenizer,
    stopwords,
    word_embeddings,
    clinical_ner,
    ner_converter
  ])

### Create a SparkDataFrame with the content

Now we will create a sample Spark dataframe with our clinical note example.

In this example we are working over a unique clinical note. In production environments a table with several of those clinical notes could be distributed in a cluster and be run in large scale systems.

In [0]:

clinical_note = (
    'A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years '
    'prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior '
    'episode of HTG-induced pancreatitis three years prior to presentation, associated '
    'with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, '
    'presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. '
    'Two weeks prior to presentation, she was treated with a five-day course of amoxicillin '
    'for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin '
    'for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months '
    'at the time of presentation. Physical examination on presentation was significant for dry oral mucosa; '
    'significantly, her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent '
    'laboratory findings on admission were: serum glucose 111 mg/dl, bicarbonate 18 mmol/l, anion gap 20, '
    'creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, glycated hemoglobin (HbA1c) '
    '10%, and venous pH 7.27. Serum lipase was normal at 43 U/L. Serum acetone levels could not be assessed '
    'as blood samples kept hemolyzing due to significant lipemia. The patient was initially admitted for '
    'starvation ketosis, as she reported poor oral intake for three days prior to admission. However, '
    'serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL, the anion gap '
    'was still elevated at 21, serum bicarbonate was 16 mmol/L, triglyceride level peaked at 2050 mg/dL, and '
    'lipase was 52 U/L. The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - '
    'the original sample was centrifuged and the chylomicron layer removed prior to analysis due to '
    'interference from turbidity caused by lipemia again. The patient was treated with an insulin drip '
    'for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL, within '
    '24 hours. Her euDKA was thought to be precipitated by her respiratory tract infection in the setting '
    'of SGLT2 inhibitor use. The patient was seen by the endocrinology service and she was discharged on '
    '40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg '
    'two times a day. It was determined that all SGLT2 inhibitors should be discontinued indefinitely. She '
    'had close follow-up with endocrinology post discharge.'
)

data_ner = spark.createDataFrame([[clinical_note]]).toDF("text")

In [0]:
data_ner.show(truncate = 100)


### Transform / annotate the clinical note using the model.

In order to process the data with the new created model we have two options.

The first one would be to use the model to transform our clinical note by the command:

`output = model_ner.transform(data_ner)`

That would save in a Spakr DataFrame (output) the resuls of running the model over the clinical note.

However for small tests like this or for real-time request a LightPipelines is a simpler way of managing the data. It will return a dictionary (instead of a Spark DataFrame) with the results of the transformation

We will create a light_pipeline_ner using our model_ner and then will annotate the clinical_note using this light_pipeline.

In [0]:
model = pipeline_ner.fit(data_ner)

light_pipeline = LightPipeline(model)
light_data = light_pipeline.annotate(clinical_note)

Now we have a dictionaty (light_data_ner) that contains the results of running the NER pipeline over our clinical note.

It contains the original document:

In [0]:
light_data['document'][0][0:100]


In [0]:
print("Number of sentences: {}".format(len(light_data['sentence'])))
print("")
for i in range(5):
    print("Sentence {}: {}".format(i, light_data['sentence'][i]))

In [0]:
print("Number of tokens: {}".format(len(light_data['token'])))
print("")
for i in range(25):
    print("Token {}: {} ({})".format(i, light_data['token'][i], light_data['ner'][i]))
print("...")

Lets apply some HTML formating to see the results of the pipeline in a nicer layout:

In [0]:
%sh
rm -rf ner_highlighter.py.1

In [0]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/utils/ner_highlighter.py


In [0]:
import sys
import ner_highlighter

# Add the path to system, local or mounted S3 bucket, e.g. /dbfs/mnt/<path_to_bucket>
sys.path.append('/databricks/driver/')
sys.path.append('/databricks/driver/ner_highlighter.py')


In [0]:
light_result_basic = light_pipeline.annotate(clinical_note)

displayHTML(ner_highlighter.token_highlighter(light_result_basic))

##  ICD10 background info

ICD-10-CM vs. ICD-10-PCS

With the transition to ICD-10, in the United States, ICD-9 codes are segmented into ICD-10-CM and ICD-10-PCS codes. **The "CM" in ICD-10-CM codes stands for clinical modification**; ICD-10-CM codes were developed by the Centers for Disease Control and Prevention in conjunction with the National Center for Health Statistics (NCHS), for outpatient medical coding and reporting in the United States, as published by the World Health Organization (WHO).

**The "PCS" in ICD-10-PCS codes stands for the procedural classification system**. ICD-10-PCS is a completely separate medical coding system from ICD-10-CM, containing an additional 87,000 codes for use ONLY in United States inpatient, hospital settings. The procedure classification system (ICD-10-PCS) was developed by the Centers for Medicare and Medicaid Services (CMS) in conjunction with 3M Health Information Management (HIM).

ICD-10-CM codes add increased specificity to their ICD-9 predecessors, growing to five times the number of codes as the present system; a total of 68,000 clinical modification diagnosis codes. ICD-10-CM codes provide the ability to track and reveal more information about the quality of healthcare, allowing healthcare providers to better understand medical complications, better design treatment and care, and better comprehend and determine the outcome of care.

ICD-10-PCS is used only for inpatient, hospital settings in the United States, and is meant to replace volume 3 of ICD-9 for facility reporting of inpatient procedures. Due to the rapid and constant state of flux in medical procedures and technology, ICD-10-PCS was developed to accommodate the changing landscape. Common procedures, lab tests, and educational sessions that are not unique to the inpatient, hospital setting have been omitted from ICD-10-PCS.

ICD-10 is confusing enough when you’re trying to digest the differences between ICD-9 and ICD-10, but there are also different types of ICD-10 codes that providers should be aware of.


Primary difference between ICD-10-CM and ICD-10-PCS

When most people talk about ICD-10, they are referring to ICD-10CM. This is the code set for diagnosis coding and is used for all healthcare settings in the United States. ICD-10PCS, on the other hand, is used in hospital inpatient settings for inpatient procedure coding.

ICD-10-CM breakdown

- Approximately 68,000 codes
- 3–7 alphanumeric characters
- Facilitates timely processing of claims


ICD-10-PCS breakdown

- Will replace ICD-9-CM for hospital inpatient use only. 
- ICD-10-PCS will not replace CPT codes used by physicians. According to HealthCare Information Management, Inc. (HCIM), “Its only intention is to identify inpatient facility services in a way not directly related to physician work, but directed towards allocation of hospital services.”

- 7 alphanumeric characters

ICD-10-PCS is very different from ICD-9-CM procedure coding due to its ability to be more specific and accurate. “This becomes increasingly important when assessing and tracking the quality of medical processes and outcomes, and compiling statistics that are valuable tools for research,” according to HCIM.

## ICD10 coding Pipeline creation.

We will now create a new pipeline that from each of these problems will try to assign an ICD10 base on the content, the wordembeddings and some pretrained models for ICD10 annotation.

The architecture of this new pipeline will be as follows:

- DocumentAssembler (text -> document)

- SentenceDetector (document -> sentence)

- Tokenizer (sentence -> token)

- WordEmbeddingsModel ([sentence, token] -> embeddings)

- NerDLModel ([sentence, token, embeddings] -> ner)

- NerConverter (["sentence, token, ner] -> ner_chunk

- ChunkTokenizer (ner_chunk -> ner_chunk_tokenized)

- ICD10CMEntityResolverModel ([ner_chunk_tokenized, embeddings] -> resolution)

- ICD10PCSEntityResolverModel ([ner_chunk_tokenized, embeddings] -> resolution)

So from a text we end having a list of Named Entities (ner_chunk) and their ICD10 codes (resolution)

Most of the annotators in this pipeline have been already created for the previous pipeline, but we need to create four additional annotators: NerConverter, ChunkEmbeddigns, EntityResolverModel for ICD10CM and EntityResolverModel for ICD10PCS.

Now we define the new pipeline

In [0]:
# Named Entity Recognition concepts parser, transforms entities into CHUNKS (required for next step: assertion status)

ner_converter = NerConverterInternal() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")\
  .setWhiteList(['PROBLEM'])\
  .setPreservePosition(False)

chunk_embeddings = ChunkEmbeddings()\
    .setInputCols("ner_chunk", "embeddings")\
    .setOutputCol("chunk_embeddings")

# ICD resolution model

icd10cm_resolution = ChunkEntityResolverModel.pretrained("chunkresolve_icd10cm_clinical", "en", "clinical/models") \
  .setInputCols(["token", "chunk_embeddings"]) \
  .setOutputCol("icd10cm_code") \
  .setDistanceFunction("COSINE") \
  .setNeighbours(5)

# .setDistanceFunction("EUCLIDEAN")

`setPreservePosition(True)` takes exactly the original indices (under some tokenization conditions it might include some undesires chars like `")","]"...)`

`setPreservePosition(False)` takes adjusted indices based on substring indexingOf the first (for begin) and last (for end) tokens

also with internal we can use the `greedyMode` which will marge consecutive entities of same type regardless of b-boundaries

In [0]:
sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")\
  
pipeline_icd10 = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    stopwords,
    word_embeddings,
    clinical_ner,
    ner_converter,
    chunk_embeddings,
    icd10cm_resolution
  ])

model_icd10 = pipeline_icd10.fit(data_ner)


In [0]:
light_pipeline_icd10 = LightPipeline(model_icd10)


In [0]:
text = light_data['document'][0]

text

In [0]:
import pandas as pd

light_result = light_pipeline_icd10.annotate(text)

df = pd.DataFrame(list(zip(light_result['ner_chunk'], light_result['icd10cm_code'])),
                  columns = ['Problem','ICD10-CM-Code'])

In [0]:
df.head()

Unnamed: 0,Problem,ICD10-CM-Code
0,gestational diabetes mellitus,P702
1,type two diabetes mellitus,E1142
2,T2DM,E1121
3,prior episode of HTG-induced pancreatitis,K860
4,associated with an acute hepatitis,B172


In [0]:
def get_icd10_codes (light_model, text, er_code):

  full_light_result = light_model.fullAnnotate(text)

  chunks = []
  codes = []
  begin = []
  end = []
  resolutions=[]

  for chunk, code in zip(full_light_result[0]['ner_chunk'], full_light_result[0][er_code]):
          
      begin.append(chunk.begin)
      end.append(chunk.end)
      chunks.append(chunk.result)
      codes.append(code.result) 
      resolutions.append(code.metadata['all_k_resolutions'])
      
  import pandas as pd

  df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 
                    er_code:codes,
                    'resolutions':resolutions})

  return df




In [0]:
df = get_icd10_codes (light_pipeline_icd10, text, 'icd10cm_code')

df

Unnamed: 0,chunks,begin,end,icd10cm_code,resolutions
0,gestational diabetes mellitus,39,67,P702,Neonatal diabetes mellitus:::Type 2 diabetes m...
1,type two diabetes mellitus,128,153,E1142,Type 2 diabetes mellitus with diabetic polyneu...
2,T2DM,156,159,E1121,Type 2 diabetes mellitus with diabetic nephrop...
3,prior episode of HTG-induced pancreatitis,167,207,K860,Alcohol-induced chronic pancreatitis:::Bipolar...
4,associated with an acute hepatitis,244,277,B172,"Acute hepatitis E:::Acute viral hepatitis, uns..."
5,obesity with a body mass index,284,313,Z6828,"Body mass index (BMI) 28.0-28.9, adult:::Body ..."
6,BMI) of 33.5 kg/m2,316,333,Z6825,"Body mass index (BMI) 25.0-25.9, adult:::Body ..."
7,polyuria,373,380,R358,Other polyuria:::Polydipsia:::Generalized edem...
8,polydipsia,383,392,R631,Polydipsia:::Anhedonia:::Galactorrhea
9,poor appetite,395,407,R630,"Anorexia:::Nutritional deficiency, unspecified..."


In [0]:
df = get_icd10_codes (light_pipeline_icd10, text, 'icd10cm_code')

df

Unnamed: 0,chunks,begin,end,icd10cm_code,resolutions
0,gestational diabetes mellitus,39,67,P702,Neonatal diabetes mellitus:::Type 2 diabetes m...
1,type two diabetes mellitus,128,153,E1142,Type 2 diabetes mellitus with diabetic polyneu...
2,T2DM,156,159,E1121,Type 2 diabetes mellitus with diabetic nephrop...
3,prior episode of HTG-induced pancreatitis,167,207,K860,Alcohol-induced chronic pancreatitis:::Bipolar...
4,associated with an acute hepatitis,244,277,B172,"Acute hepatitis E:::Acute viral hepatitis, uns..."
5,obesity with a body mass index,284,313,Z6828,"Body mass index (BMI) 28.0-28.9, adult:::Body ..."
6,BMI) of 33.5 kg/m2,316,333,Z6825,"Body mass index (BMI) 25.0-25.9, adult:::Body ..."
7,polyuria,373,380,R358,Other polyuria:::Polydipsia:::Generalized edem...
8,polydipsia,383,392,R631,Polydipsia:::Anhedonia:::Galactorrhea
9,poor appetite,395,407,R630,"Anorexia:::Nutritional deficiency, unspecified..."


In [0]:
import pyspark.sql.functions as F

output = model_icd10.transform(data_ner).cache()

output.select(F.explode(F.arrays_zip("ner_chunk.result","ner_chunk.metadata",
                                     "icd10cm_code.result","icd10cm_code.metadata")).alias("icd10cm_result")) \
.select(F.expr("icd10cm_result['0']").alias("chunk"),
        F.expr("icd10cm_result['1'].entity").alias("entity"),
        F.expr("icd10cm_result['3'].resolved_text").alias("resolved_text"),
        F.expr("icd10cm_result['2']").alias("code"),
        F.expr("icd10cm_result['3'].all_k_resolutions").alias("cms"))\
.distinct() \
.toPandas()


Unnamed: 0,chunk,entity,resolved_text,code,cms
0,type two diabetes mellitus,PROBLEM,Type 2 diabetes mellitus with diabetic polyneu...,E1142,Type 2 diabetes mellitus with diabetic polyneu...
1,amoxicillin for a respiratory tract infection,PROBLEM,"Respiratory disorder, unspecified",J989,"Respiratory disorder, unspecified:::Acute naso..."
2,lipemia,PROBLEM,Glycosuria,R81,Glycosuria:::Pure hyperglyceridemia:::Hyperchy...
3,gestational diabetes mellitus,PROBLEM,Neonatal diabetes mellitus,P702,Neonatal diabetes mellitus:::Type 2 diabetes m...
4,benign with no tenderness,PROBLEM,Periumbilic abdominal tenderness,R10815,Periumbilic abdominal tenderness:::Epigastric ...
5,vomiting,PROBLEM,Bilious vomiting,R1114,Bilious vomiting:::Vomiting without nausea:::N...
6,polydipsia,PROBLEM,Polydipsia,R631,Polydipsia:::Anhedonia:::Galactorrhea
7,significant for dry oral mucosa,PROBLEM,Irritative hyperplasia of oral mucosa,K136,Irritative hyperplasia of oral mucosa:::Leukop...
8,poor appetite,PROBLEM,Anorexia,R630,"Anorexia:::Nutritional deficiency, unspecified..."
9,euDKA,PROBLEM,Shortness of breath,R0602,Shortness of breath:::Chancroid:::Phimosis:::R...


In [0]:
text = 'He has a starvation ketosis but nothing found for significant for dry oral mucosa'


In [0]:

df = get_icd10_codes(light_pipeline_icd10, text, 'icd10cm_code')

df

Unnamed: 0,chunks,begin,end,icd10cm_code,resolutions
0,starvation ketosis,9,26,E71121,Propionic acidemia:::Bartter's syndrome:::Hypo...
1,significant for dry oral mucosa,50,80,K136,Irritative hyperplasia of oral mucosa:::Leukop...


# ICD10 with SentenceEntityResolver (BioBert) (after Spark NLP 2.7)

We have 7 new `english` Sentence Entity Resolution models for Clinical Terminologies:
   - `biobertresolve_cpt` trained with `BertSentenceEmbeddings.pretrained('sent_biobert_pubmed_base_cased')`
   - `biobertresolve_icdo` trained with `BertSentenceEmbeddings.pretrained('sent_biobert_pubmed_base_cased')`
   - `biobertresolve_icd10cm` trained with `BertSentenceEmbeddings.pretrained('sent_biobert_pubmed_base_cased')`
   - `biobertresolve_icd10pcs` trained with `BertSentenceEmbeddings.pretrained('sent_biobert_pubmed_base_cased')`
   - `biobertresolve_loinc` trained with `BertSentenceEmbeddings.pretrained('sent_biobert_pubmed_base_cased')`
   - `biobertresolve_snomed` trained with `BertSentenceEmbeddings.pretrained('sent_biobert_pubmed_base_cased')`
   - `biobertresolve_rxnorm` trained with `BertSentenceEmbeddings.pretrained('sent_biobert_pubmed_base_cased')`

In [0]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")\

# WordEmbeddingsModel pretrained "embeddings_clinical" includes a model of 1.7Gb that needs to be downloaded

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

# Named Entity Recognition for clinical concepts.
clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")


  

In [0]:
ner_converter = NerConverterInternal() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")\
  .setWhiteList(['PROBLEM'])\
  .setPreservePosition(False)

c2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") 

bert_embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased")\
  .setInputCols(["ner_chunk_doc"])\
  .setOutputCol("bert_embeddings")

icd10pcs_resolution = SentenceEntityResolverModel.pretrained("biobertresolve_icd10pcs", "en", "clinical/models") \
  .setInputCols(["ner_chunk", "bert_embeddings"]) \
  .setOutputCol("icd10pcs_code")
  
icd10cm_resolution = SentenceEntityResolverModel.pretrained("biobertresolve_icd10cm", "en", "clinical/models") \
  .setInputCols(["ner_chunk", "bert_embeddings"]) \
  .setOutputCol("icd10cm_code")

bert_pipeline_icd10 = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    c2doc,
    bert_embeddings,
    icd10pcs_resolution,
    icd10cm_resolution
  ])

bert_model_icd10 = bert_pipeline_icd10.fit(data_ner)

bert_light_pipeline_icd10 = LightPipeline(bert_model_icd10)

In [0]:
text = 'A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa; significantly, her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent laboratory findings on admission were: serum glucose 111 mg/dl, bicarbonate 18 mmol/l, anion gap 20, creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, glycated hemoglobin (HbA1c) 10%, and venous pH 7.27. Serum lipase was normal at 43 U/L. Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia. The patient was initially admitted for starvation ketosis, as she reported poor oral intake for three days prior to admission. However, serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL, the anion gap was still elevated at 21, serum bicarbonate was 16 mmol/L, triglyceride level peaked at 2050 mg/dL, and lipase was 52 U/L. The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again. The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL, within 24 hours. Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use. The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day. It was determined that all SGLT2 inhibitors should be discontinued indefinitely. She had close follow-up with endocrinology post discharge.'

light_result = bert_light_pipeline_icd10.annotate(text)

light_result.keys()

In [0]:
light_result['icd10pcs_code']

In [0]:
light_result['ner_chunk']

In [0]:

df = pd.DataFrame(list(zip(light_result['ner_chunk'], light_result['icd10cm_code'])),
                  columns = ['Problem','icd10cm_code'])

df

Unnamed: 0,Problem,icd10cm_code
0,gestational diabetes mellitus,O24410
1,type two diabetes mellitus,E119
2,"T2DM),",E119
3,HTG-induced pancreatitis,K8522
4,an acute hepatitis,B172
5,obesity,E6609
6,a body mass index,Z6845
7,BMI) of 33.5 kg/m2,Z681
8,polyuria,R358
9,polydipsia,H93233


In [0]:

df = get_icd10_codes (bert_light_pipeline_icd10, text, 'icd10cm_code')

df

Unnamed: 0,chunks,begin,end,icd10cm_code,resolutions
0,gestational diabetes mellitus,39,67,O24410,"Gestational diabetes mellitus in pregnancy, di..."
1,type two diabetes mellitus,128,153,E119,Type 2 diabetes mellitus without complications...
2,"T2DM),",156,161,E119,Type 2 diabetes mellitus without complications...
3,HTG-induced pancreatitis,184,207,K8522,Alcohol induced acute pancreatitis with infect...
4,an acute hepatitis,260,277,B172,Acute hepatitis E:::Acute hepatitis B without ...
5,obesity,284,290,E6609,Other obesity due to excess calories:::CR(E)ST...
6,a body mass index,297,313,Z6845,"Body mass index (BMI) 70 or greater, adult:::B..."
7,BMI) of 33.5 kg/m2,316,333,Z681,"Body mass index (BMI) 19.9 or less, adult:::Bo..."
8,polyuria,373,380,R358,Other polyuria:::Dysuria:::Cystinuria:::Person...
9,polydipsia,383,392,H93233,"Hyperacusis, bilateral:::Sleepwalking [somnamb..."


In [0]:
text = 'He has a starvation ketosis but nothing found for significant for dry oral mucosa'

# ICD10 CM

df = get_icd10_codes (bert_light_pipeline_icd10, text, 'icd10cm_code')

df

Unnamed: 0,chunks,begin,end,icd10cm_code,resolutions
0,a starvation ketosis,7,26,E873,"Alkalosis:::Starvation, subsequent encounter::..."
1,dry oral mucosa,66,80,R0982,Postnasal drip:::Nasal congestion:::Plicated t...


# RxNorm Resolver

`setAlternatives` : number of results to return in the metadata after sorting by last distance calculated

`setNeighbours` : number of neighbours to consider in the KNN query to calculate WMD

`setEnableLevenshtein`: whether or not to use Levenshtein character distance.

`setDistanceWeights` : `[WMD, TFIDF, Jaccard, SorensenDice, JaroWinkler, Levenshtein]`

In [0]:
# Tokenizer splits words in a relevant format for NLP

tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("raw_token")\

# Tokenizer splits words in a relevant format for NLP

stopwords = StopWordsCleaner()\
  .setInputCols(["raw_token"])\
  .setOutputCol("token")

ner_converter = NerConverterInternal() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("greedy_chunk")\
  .setWhiteList(['TREATMENT'])

chunk_embeddings = ChunkEmbeddings()\
    .setInputCols("greedy_chunk", "embeddings")\
    .setOutputCol("chunk_embeddings")

rxnorm_resolver1 = ChunkEntityResolverModel()\
    .pretrained('chunkresolve_rxnorm_sbd_clinical', 'en', "clinical/models")\
    .setEnableLevenshtein(True)\
    .setNeighbours(200).setAlternatives(5).setDistanceWeights([3,11,0,0,0,9])\
    .setInputCols('token', 'chunk_embeddings')\
    .setOutputCol('rxnorm_resolution')\
    .setPoolingStrategy("MAX")

pipeline_rx = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    stopwords,
    word_embeddings,
    clinical_ner,
    ner_converter,
    chunk_embeddings,
    rxnorm_resolver1
  ])

model_rxnorm = pipeline_rx.fit(data_ner)


In [0]:
text

In [0]:
output = model_rxnorm.transform(data_ner)

output.show()

In [0]:
output.select(F.explode(F.arrays_zip("greedy_chunk.result","greedy_chunk.metadata","rxnorm_resolution.result","rxnorm_resolution.metadata")).alias("rxnorm_result")) \
    .select(F.expr("rxnorm_result['0']").alias("chunk"),
            F.expr("rxnorm_result['1'].entity").alias("entity"),
            F.expr("rxnorm_result['3'].all_k_resolutions").alias("target_text"),
            F.expr("rxnorm_result['2']").alias("code"),
            F.expr("rxnorm_result['3'].confidence").alias('confidence')).show(truncate = 100)

In [0]:
text = 'The patient was prescribed 1 prozac 60mg (oral capsules) for 5 days after meals. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day.'
text

In [0]:
data_ner = spark.createDataFrame([[text]]).toDF("text")

In [0]:

tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("raw_token")\

stopwords = StopWordsCleaner()\
  .setInputCols(["raw_token"])\
  .setOutputCol("token")

posology_ner = NerDLModel.pretrained("ner_drugs_large", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

ner_converter = NerConverter() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")

chunk_embeddings = ChunkEmbeddings()\
    .setInputCols("ner_chunk", "embeddings")\
    .setOutputCol("chunk_embeddings")

rxnorm_resolver = ChunkEntityResolverModel()\
    .pretrained('chunkresolve_rxnorm_sbd_clinical', 'en', "clinical/models")\
    .setEnableLevenshtein(True)\
    .setNeighbours(200).setAlternatives(5).setDistanceWeights([3,11,0,0,0,9])\
    .setInputCols('token', 'chunk_embeddings')\
    .setOutputCol('rxnorm_resolution')\
    .setPoolingStrategy("MAX")

posology_rx = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    stopwords,
    word_embeddings,
    posology_ner,
    ner_converter,
    chunk_embeddings,
    rxnorm_resolver
  ])

model_rxnorm = posology_rx.fit(data_ner)

output = model_rxnorm.transform(data_ner)

output.select(F.explode(F.arrays_zip("ner_chunk.result","ner_chunk.metadata","rxnorm_resolution.result","rxnorm_resolution.metadata")).alias("rxnorm_result")) \
    .select(F.expr("rxnorm_result['0']").alias("chunk"),
            F.expr("rxnorm_result['1'].entity").alias("entity"),
            F.expr("rxnorm_result['3'].all_k_resolutions").alias("target_text"),
            F.expr("rxnorm_result['2']").alias("code"),
            F.expr("rxnorm_result['3'].confidence").alias("confidence")).show(truncate = 100)

## Pretrained RxNorm Resolver

In [0]:
output = model.transform(data_ner)

In [0]:
data_ner.show(1)

In [0]:
output.columns

In [0]:

posology_ner = NerDLModel.pretrained("ner_drugs_large", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

ner_converter = NerConverter() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")

chunk_embeddings = ChunkEmbeddings()\
    .setInputCols("ner_chunk", "embeddings")\
    .setOutputCol("chunk_embeddings")

rxnorm_resolver1 = ChunkEntityResolverModel()\
    .pretrained('chunkresolve_rxnorm_sbd_clinical', 'en', "clinical/models")\
    .setEnableLevenshtein(True)\
    .setNeighbours(200).setAlternatives(5).setDistanceWeights([3,11,0,0,0,9])\
    .setInputCols('token', 'chunk_embeddings')\
    .setOutputCol('rxnorm_resolution')\
    .setPoolingStrategy("MAX")


posology_rx_pretrained = Pipeline(
    stages = [
    posology_ner,
    ner_converter,
    chunk_embeddings,
    rxnorm_resolver1
  ])

model_rxnorm_pretrained = posology_rx_pretrained.fit(output)


In [0]:
model_rxnorm_pretrained.write().overwrite().save('dbfs:/databricks/driver/saved_model_rxnorm_pretrained')


In [0]:

loaded_model_rxnorm_pretrained = PipelineModel.load ('dbfs:/databricks/driver/saved_model_rxnorm_pretrained')

In [0]:
loaded_model_rxnorm_pretrained.stages

In [0]:
data_ner.show(1)

In [0]:
posology_rx_pretrained = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    stopwords,
    word_embeddings,
    loaded_model_rxnorm_pretrained
  ])

posology_rxnorm_pretrained = posology_rx_pretrained.fit(data_ner)

pretrained_output = posology_rxnorm_pretrained.transform(data_ner)

pretrained_output.select(F.explode(F.arrays_zip("ner_chunk.result","ner_chunk.metadata","rxnorm_resolution.result","rxnorm_resolution.metadata")).alias("rxnorm_result")) \
    .select(F.expr("rxnorm_result['0']").alias("chunk"),
            F.expr("rxnorm_result['1'].entity").alias("entity"),
            F.expr("rxnorm_result['3'].all_k_resolutions").alias("target_text"),
            F.expr("rxnorm_result['2']").alias("code"),
            F.expr("rxnorm_result['3'].confidence").alias("confidence")).show(truncate = 100)

In [0]:
pretrained_output.show(1)

In [0]:
from sparknlp.pretrained import ResourceDownloader
loaded_rxnorm_pretrained = ResourceDownloader.downloadPipeline("ppl_posology_rxnorm","en","clinical/models")

loaded_rxnorm_pretrained.stages

In [0]:
posology_rx_pretrained = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    stopwords,
    word_embeddings,
    loaded_rxnorm_pretrained
  ])

posology_rxnorm_pretrained = posology_rx_pretrained.fit(data_ner)

pretrained_output = posology_rxnorm_pretrained.transform(data_ner)

pretrained_output.select(F.explode(F.arrays_zip("ner_chunk.result","ner_chunk.metadata","rxnorm_resolution.result","rxnorm_resolution.metadata")).alias("rxnorm_result")) \
    .select(F.expr("rxnorm_result['0']").alias("chunk"),
            F.expr("rxnorm_result['1'].entity").alias("entity"),
            F.expr("rxnorm_result['3'].all_k_resolutions").alias("target_text"),
            F.expr("rxnorm_result['2']").alias("code"),
            F.expr("rxnorm_result['3'].confidence").alias("confidence")).show(truncate = 100)

# RxNorm with SentenceEntityResolver (BioBert) (after Spark NLP 2.7)

In [0]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")\

# WordEmbeddingsModel pretrained "embeddings_clinical" includes a model of 1.7Gb that needs to be downloaded

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

posology_ner = NerDLModel.pretrained("ner_drugs_large", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

ner_converter = NerConverter() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")

c2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") 

bert_embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased")\
  .setInputCols(["ner_chunk_doc"])\
  .setOutputCol("bert_embeddings")

rxnorm_resolution = SentenceEntityResolverModel.pretrained("biobertresolve_rxnorm_bdcd", "en", "clinical/models") \
  .setInputCols(["ner_chunk", "bert_embeddings"]) \
  .setOutputCol("rxnorm_code")

bert_pipeline_rxnorm = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    ner_converter,
    c2doc,
    bert_embeddings,
    rxnorm_resolution
  ])

bert_model_rxnorm = bert_pipeline_rxnorm.fit(data_ner)

bert_light_pipeline_rxnorm = LightPipeline(bert_model_rxnorm)

In [0]:

bert_rxnorm_output = bert_model_rxnorm.transform(data_ner)

bert_rxnorm_output.select(F.explode(F.arrays_zip("ner_chunk.result","ner_chunk.metadata","rxnorm_code.result","rxnorm_code.metadata")).alias("rxnorm_result")) \
    .select(F.expr("rxnorm_result['0']").alias("chunk"),
            F.expr("rxnorm_result['1'].entity").alias("entity"),
            F.expr("rxnorm_result['3'].all_k_resolutions").alias("target_text"),
            F.expr("rxnorm_result['2']").alias("code"),
            F.expr("rxnorm_result['3'].confidence").alias("distance")).show(truncate = 100)

In [0]:
text = 'A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa; significantly, her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent laboratory findings on admission were: serum glucose 111 mg/dl, bicarbonate 18 mmol/l, anion gap 20, creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, glycated hemoglobin (HbA1c) 10%, and venous pH 7.27. Serum lipase was normal at 43 U/L. Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia. The patient was initially admitted for starvation ketosis, as she reported poor oral intake for three days prior to admission. However, serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL, the anion gap was still elevated at 21, serum bicarbonate was 16 mmol/L, triglyceride level peaked at 2050 mg/dL, and lipase was 52 U/L. The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again. The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL, within 24 hours. Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use. The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day. It was determined that all SGLT2 inhibitors should be discontinued indefinitely. She had close follow-up with endocrinology post discharge.'

light_result = bert_light_pipeline_rxnorm.annotate(text)


df = pd.DataFrame(list(zip(light_result['ner_chunk'], light_result['rxnorm_code'])),
                  columns = ['Drug','rxnorm_code'])

df


Unnamed: 0,Drug,rxnorm_code
0,amoxicillin,308191
1,metformin,311572
2,glipizide,310488
3,dapagliflozin,1486977
4,atorvastatin,312962
5,gemfibrozil,201520
6,dapagliflozin,1486977
7,Serum acetone,618977
8,insulin drip,311083
9,SGLT2 inhibitor,104625


# ICD10 + RxNorm with multiple NERs

In [0]:
notes = [
'Pentamidine 300 mg IV q . 36 hours , Pentamidine nasal wash 60 mg per 6 ml of sterile water q.d . , voriconazole 200 mg p.o . b.i.d . , acyclovir 400 mg p.o . b.i.d . , cyclosporine 50 mg p.o . b.i.d . , prednisone 60 mg p.o . q.d . , GCSF 480 mcg IV q.d . , Epogen 40,000 units subcu q . week , Protonix 40 mg q.d . , Simethicone 80 mg p.o . q . 8 , nitroglycerin paste 1 " ; q . 4 h . p.r.n . , flunisolide nasal inhaler , 2 puffs q . 8 , OxyCodone 10-15 mg p.o . q . 6 p.r.n . , Sudafed 30 mg q . 6 p.o . p.r.n . , Fluconazole 2% cream b.i.d . to erythematous skin lesions , Ditropan 5 mg p.o . b.i.d . , Tylenol 650 mg p.o . q . 4 h . p.r.n . , Ambien 5-10 mg p.o . q . h.s . p.r.n . , Neurontin 100 mg q . a.m . , 200 mg q . p.m . , Aquaphor cream b.i.d . p.r.n . , Lotrimin 1% cream b.i.d . to feet , Dulcolax 5-10 mg p.o . q.d . p.r.n . , Phoslo 667 mg p.o . t.i.d . , Peridex 0.12% , 15 ml p.o . b.i.d . mouthwash , Benadryl 25-50 mg q . 4-6 h . p.r.n . pruritus , Sarna cream q.d . p.r.n . pruritus , Nystatin 5 ml p.o . q.i.d . swish and !',
'Albuterol nebulizers 2.5 mg q.4h . and Atrovent nebulizers 0.5 mg q.4h . , please alternate albuterol and Atrovent ; Rocaltrol 0.25 mcg per NG tube q.d .; calcium carbonate 1250 mg per NG tube q.i.d .; vitamin B12 1000 mcg IM q . month , next dose is due Nov 18 ; diltiazem 60 mg per NG tube t.i.d .; ferrous sulfate 300 mg per NG t.i.d .; Haldol 5 mg IV q.h.s .; hydralazine 10 mg IV q.6h . p.r.n . hypertension ; lisinopril 10 mg per NG tube q.d .; Ativan 1 mg per NG tube q.h.s .; Lopressor 25 mg per NG tube t.i.d .; Zantac 150 mg per NG tube b.i.d .; multivitamin 10 ml per NG tube q.d .; Macrodantin 100 mg per NG tube q.i.d . x 10 days beginning on 11/3/00 .',
'Tylenol 650 mg p.o . q . 4-6h p.r.n . headache or pain ; acyclovir 400 mg p.o . t.i.d .; acyclovir topical t.i.d . to be applied to lesion on corner of mouth ; Peridex 15 ml p.o . b.i.d .; Mycelex 1 troche p.o . t.i.d .; g-csf 404 mcg subcu q.d .; folic acid 1 mg p.o . q.d .; lorazepam 1-2 mg p.o . q . 4-6h p.r.n . nausea and vomiting ; Miracle Cream topical q.d . p.r.n . perianal irritation ; Eucerin Cream topical b.i.d .; Zantac 150 mg p.o . b.i.d .; Restoril 15-30 mg p.o . q . h.s . p.r.n . insomnia ; multivitamin 1 tablet p.o . q.d .; viscous lidocaine 15 ml p.o . q . 3h can be applied to corner of mouth or lips p.r.n . pain control .',
'The patient\'s incisions sternal and right leg were clean and healing well , normal sinus rhythm at 70-80 , with blood pressure 98-110/60 and patient was doing well , recovering , ambulating , tolerating regular diet and last hematocrit prior to discharge was 39% with a BUN and creatinine of 15 and 1.0 , prothrombin time level of 13.8 , chest X-ray prior to discharge showed small bilateral effusions with mild cardiomegaly and subsegmental atelectasis bibasilar and electrocardiogram showed normal sinus rhythm with left atrial enlargement and no acute ischemic changes on electrocardiogram .',
'This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .',
'O2 95% on 3L NC mixed Quinn 82% genrl : in nad , resting comfortably heent : perrla ( 4->3 mm ) bilaterally , blind in right visual field , eomi , dry mm , ? thrush neck : no bruits cv : rrr , no m/r/g , faint s1/s2 pulm : cta bilaterally abd : midline scar ( from urostomy ) , nabs , soft , appears distended but patient denies , ostomy RLQ c/d/i , NT to palpation back : right flank urostomy tube , c/d/i , nt to palpation extr : no Gardner neuro : a , ox3 , wiggles toes bilaterally , unable to lift LE , 06-12 grip bilaterally w/ UE , decrease sensation to soft touch in left',
'Is notable for an inferior myocardial infarction , restrictive and obstructive lung disease with an FEV1 of . 9 and FVC of 1.34 and a moderate at best response to bronchodilators , and a negative sestamibi scan in May , 1999 apart from a severe fixed inferolateral defect , systolic dysfunction with recent echocardiography revealing an LVID of 62 mm . and ejection fraction of 28 percent , moderate mitral regurgitation and mild-to-moderate aortic stenosis with a peak gradient of 33 and a mean gradient of 19 and a valve area of 1.4 cm . squared .',
'This is a 47 - year-old male with a past medical history of type 2 diabetes , high cholesterol , hypertension , and coronary artery disease , status post percutaneous transluminal coronary angioplasty times two , who presented with acute coronary syndrome refractory to medical treatment and TNK , now status post Angio-Jet percutaneous transluminal coronary angioplasty and stent of proximal left anterior descending artery and percutaneous transluminal coronary angioplasty of first diagonal with intra-aortic balloon pump placement .',
'Clinical progression of skin and sinus infection on maximal antimicrobial therapy continued , with emergence on November 20 of a new right-sided ptosis in association with a left homonymous hemianopsia , and fleeting confusion while febrile , prompting head MRI which revealed a large 5 x 2 x 4.3 cm region in the right occipital lobe of hemorrhage and edema , with dural and , likely , leptomeningeal enhancement in association with small foci in the right cerebellum and pons , concerning for early lesions of similar type .',
'The patient had an echocardiogram on day two of admission , which revealed a mildly dilated left atrium , mild symmetric LVH , normal LV cavity size , mild region LV systolic dysfunction , arresting regional wall motion abnormality including focal apical hypokinesis , a normal right ventricular chamber size and free wall motion , a moderately dilated aortic root , a mildly dilated ascending aorta , normal aortic valve leaflet , normal mitral valve leaflet and no pericardial effusions .',
'The patient is a 65-year-old man with refractory CLL , status post non-myeloblative stem cell transplant approximately nine months prior to admission , and status post prolonged recent Retelk County Medical Center stay for Acanthamoeba infection of skin and sinuses , complicated by ARS due to medication toxicity , as well as GVHD and recent CMV infection , readmitted for new fever , increasing creatinine , hepatomegaly and fluid surge spacing , in the setting of hyponatremia .',
'Tylenol 650 mg p.o . q.4h . p.r.n . , Benadryl 25 mg p.o . q.h.s . p.r.n . , Colace 100 mg p.o . q.i.d . , Nortriptyline 25 mg p.o . q.h.s . , Simvastatin 10 mg p.o . q.h.s . , Metamucil one packet p.o . b.i.d . p.r.n . , Neurontin 300 mg p.o . t.i.d . , Levsinex 0.375 mg p.o . q.12h . , Lisinopril / hydrochlorothiazide 20/25 mg p.o . q.d . , hydrocortisone topical ointment to affected areas , MS Contin 30 mg p.o . b.i.d . , MSIR 15 to 30 mg p.o . q.4h . p.r.n . pain .',
'Aspirin 325 q.d . ; albuterol nebs 2.5 mg q . 4h ; Colace 100 mg b.i.d . ; heparin 5,000 units subcu b.i.d . ; Synthroid 200 mcg q.d . ; Ocean Spray 2 sprays q . i.d . ; simvastatin 10 mg q . h.s . ; Flovent 220 mcg 2 puffs b.i.d . ; Zantac 150 b.i.d . ; nystatin ointment to the gluteal fold b.i.d . ; Lisinopril 20 mg q.d . ; Mestinon controlled release 180 q . h.s . ; Mestinon 30 mg q . 4h while awake ; prednisone 60 mg p.o . q . IM ; Atrovent nebs 0.5 mg q . i.d .',
'An echocardiogram was obtained on 4-26 which showed concentric left ventricular hypertrophy with normal _____ left ventricular function , severe right ventricular dilatation with septal hypokinesis and flattening with a question of right ventricular apical clot raised with mild aortic stenosis , severe tricuspid regurgitation and increased pulmonary artery pressure of approximately 70 millimeters , consistent with fairly severe pulmonary hypertension .',
'1 ) CV ( R ) finished amio IV load then started on po , agressive lytes ; although interrogation showed >100 episodes of VT ( as / x ) , pt prefers med therapy as opposed to ablation ( I ) enzymes mildly elevated but not actively ischemic ; lipids , ASA , statin , BB ; Adenosine thal 1/4 and echo 1/4 to look for signs of ischemia as active cause for VT ( P ) JVP at angle of jaw 1/4 -- > giving 20 Lasix ; dig level 1/4 1.3 -- > 1/2 dose as on Amio',
'sodium 141 , potassium 3.5 , chloride 107 , bicarbonates 23.8 , BUN 23 , creatinine 1.1 , glucose 165 , PO2 377 , PCO2 32 , PH 7.50 , asomus 298 , toxic screen negative , white blood cell count 11.1 , hematocrit 39.6 , platelet count 137 , prothrombin time 25.2 , INR 4.3 , partial thromboplastin time 34.7 , urinalysis 1+ albumin , 0-5 high link caths , cervical spine negative , pelvis negative , lumbar spine ; negative , thoracic spine negative .',
]

In [0]:
from IPython.core.display import display, HTML

html_output=""
for i, d in enumerate(notes):
    html_output += f'Note {i}:'
    html_output +='<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px">'
    html_output += d
    html_output += '</div><br/>'

display(HTML(html_output))

In [0]:
data = spark.createDataFrame([(i,n.lower()) for i,n in enumerate(notes)]).toDF('doc_id', 'text')

data.show(truncate=50)

let's build a SparkNLP pipeline with the following stages:

`DocumentAssembler`: Entry annotator for our pipelines; it creates the data structure for the Annotation Framework

`SentenceDetector`: Annotator to pragmatically separate complete sentences inside each document

`Tokenizer`: Annotator to separate sentences in tokens (generally words)

`StopWordsCleaner`: Annotator to remove words defined as StopWords in SparkML

`WordEmbeddings`: Vectorization of word tokens, in this case using word embeddings trained from PubMed, ICD10 and other clinical resources.

`ChunkEmbeddings`: Aggregates the WordEmbeddings for each NER Chunk

`JSL NER + NerConverter`: This annotators return Chunks related to jsl_ner (generic ner) 

`Drug NER + NerConverter`: This annotators return Chunks related to drugs

`ChunkEntityResolver`: Annotator that performs search for the KNNs, in this case trained from ICDO Histology Behavior.

In [0]:
# Annotators responsible for the Cancer Genetics Entity Recognition task

jslNer = NerDLModel.pretrained('ner_jsl', 'en', "clinical/models")\
    .setInputCols('sentence', 'token', 'embeddings')\
    .setOutputCol('ner_jsl')

drugNer = NerDLModel.pretrained('ner_drugs', 'en', "clinical/models")\
    .setInputCols('sentence', 'token', 'embeddings')\
    .setOutputCol('ner_drug')

In [0]:

#Converter annotators transform IOB tags into full chunks (sequence set of tokens) tagged with `entity` metadata

jslConverter = NerConverter()\
    .setInputCols('sentence', 'token', 'ner_jsl')\
    .setOutputCol('chunk_jsl')\
    .setWhiteList(["Diagnosis"])

drugConverter = NerConverter()\
    .setInputCols('sentence', 'token', 'ner_drug')\
    .setOutputCol('chunk_drug')

In [0]:

#ChunkEmbeddings annotators aggregate embeddings for each token in the chunk

jslChunkEmbeddings = ChunkEmbeddings()\
  .setInputCols('chunk_jsl', 'embeddings')\
  .setOutputCol('chunk_embs_jsl')

drugChunkEmbeddings = ChunkEmbeddings()\
  .setInputCols('chunk_drug', 'embeddings')\
  .setOutputCol('chunk_embs_drug')

In [0]:
# Entity Resolution Pretrained Models

icd10cmResolver2 = ChunkEntityResolverModel.pretrained('chunkresolve_icd10cm_diseases_clinical', 'en', "clinical/models")\
    .setEnableLevenshtein(True)\
    .setNeighbours(200).setAlternatives(5).setDistanceWeights([3,3,2,0,0,7])\
    .setInputCols('token', 'chunk_embs_jsl')\
    .setOutputCol('icd10cm_resolution')

rxnormResolver2 = ChunkEntityResolverModel()\
    .pretrained('chunkresolve_rxnorm_scd_clinical', 'en', "clinical/models")\
    .setEnableLevenshtein(True)\
    .setNeighbours(200).setAlternatives(5).setDistanceWeights([3,3,2,0,0,7])\
    .setInputCols('token', 'chunk_embs_drug')\
    .setOutputCol('rxnorm_resolution')\

In [0]:
# Tokenizer splits words in a relevant format for NLP

tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("raw_token")\

# Tokenizer splits words in a relevant format for NLP

stopwords = StopWordsCleaner()\
  .setInputCols(["raw_token"])\
  .setOutputCol("token")


pipelineFull = Pipeline().setStages([
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    stopwords, 
    word_embeddings, 
    jslNer,
    drugNer,
    jslConverter,
    drugConverter,
    jslChunkEmbeddings, 
    drugChunkEmbeddings,
    icd10cmResolver2,
    rxnormResolver2
])

In [0]:
# Persisiting temporarily to keep DAG size and resource usage low (Word Embeddings are Resource Intensive)
pipelineModelFull = pipelineFull.fit(data)

output = pipelineModelFull.transform(data)



In [0]:
output.write.mode("overwrite").save("temp")

output = spark.read.load("temp")

In [0]:
%%time
output.show()

In [0]:
# lets see what would have happened if we hadn't persisted the pipeline at disk. 
output = pipelineModelFull.transform(data)

In [0]:
#%%time
output.show()
## 1.8 vs 37.7 seconds for the first 20 rows (x20 faster)

In [0]:
def quick_metadata_analysis(df, doc_field, chunk_field, code_fields):
    code_res_meta = ", ".join([f"{cf}.metadata" for cf in code_fields])
    expression = f"explode(arrays_zip({chunk_field}.begin, {chunk_field}.end, {chunk_field}.result, {chunk_field}.metadata, "+code_res_meta+")) as a"
    top_n_rest = [(f"float(a['{i+4}'].confidence) as {(cf.split('_')[0])}_conf",
                    f"arrays_zip(split(a['{i+4}'].all_k_results,':::'),split(a['{i+4}'].all_k_resolutions,':::')) as {cf.split('_')[0]+'_opts'}")
                    for i, cf in enumerate(code_fields)]
    top_n_rest_args = []
    for tr in top_n_rest:
        for t in tr:
            top_n_rest_args.append(t)
    return df.selectExpr(doc_field, expression) \
        .orderBy('doc_id', F.expr("a['0']"), F.expr("a['1']"))\
        .selectExpr(f"concat_ws('::',{doc_field},a['0'],a['1']) as coords", "a['2'] as chunk","a['3'].entity as entity", *top_n_rest_args)

In [0]:
icd10cm_analysis = quick_metadata_analysis(output, 'doc_id', 'chunk_jsl',['icd10cm_resolution']).toPandas()

In [0]:
rxnorm_analysis = \
quick_metadata_analysis(output, 'doc_id', 'chunk_drug',['rxnorm_resolution']).toPandas()

In [0]:
pd.set_option('display.max_colwidth', 250)
pd.set_option('display.max_rows', 500)

In [0]:
icd10cm_analysis[icd10cm_analysis.icd10cm_conf>0.4]

Unnamed: 0,coords,chunk,entity,icd10cm_conf,icd10cm_opts
1,2::499::506,insomnia,Diagnosis,0.905,"[(G4700, Insomnia, unspecified), (G4709, Other insomnia), (F5102, Adjustment insomnia), (F5101, Primary insomnia), (F5109, Other insomnia not due to a substance or known physiological condition)]"
4,4::120::128,gastritis,Diagnosis,0.468,"[(K2970, Gastritis, unspecified, without bleeding), (B9681, Helicobacter pylori [H. pylori] as the cause of diseases classified elsewhere), (K2900, Acute gastritis without bleeding), (A084, Viral intestinal infection, unspecified), (K2960, Other ..."
5,6::67::103,obstructive lung disease with an fev1,Diagnosis,0.4564,"[(J670, Farmer's lung), (J984, Other disorders of lung), (J449, Chronic obstructive pulmonary disease, unspecified), (J849, Interstitial pulmonary disease, unspecified), (J440, Chronic obstructive pulmonary disease with acute lower respiratory in..."
7,6::274::293,systolic dysfunction,Diagnosis,0.8329,"[(I519, Heart disease, unspecified), (I5040, Unspecified combined systolic (congestive) and diastolic (congestive) heart failure), (I5020, Unspecified systolic (congestive) heart failure), (N522, Drug-induced erectile dysfunction), (F5221, Male e..."
9,10::223::264,acanthamoeba infection of skin and sinuses,Diagnosis,0.4519,"[(L089, Local infection of the skin and subcutaneous tissue, unspecified), (B6010, Acanthamebiasis, unspecified), (A311, Cutaneous mycobacterial infection), (B383, Cutaneous coccidioidomycosis), (L080, Pyoderma)]"
11,10::410::445,hepatomegaly and fluid surge spacing,Diagnosis,0.4627,"[(E8779, Other fluid overload), (E860, Dehydration), (E8770, Fluid overload, unspecified), (I313, Pericardial effusion (noninflammatory)), (J811, Chronic pulmonary edema)]"
12,10::456::478,setting of hyponatremia,Diagnosis,0.999,"[(E871, Hypo-osmolality and hyponatremia), (I953, Hypotension of hemodialysis), (I952, Hypotension due to drugs), (E870, Hyperosmolality and hypernatremia), (J9602, Acute respiratory failure with hypercapnia)]"
15,14::323::330,ischemia,Diagnosis,0.8969,"[(G450, Vertebro-basilar artery syndrome), (N280, Ischemia and infarction of kidney), (H3582, Retinal ischemia), (I6782, Cerebral ischemia), (I248, Other forms of acute ischemic heart disease)]"


In [0]:
rxnorm_analysis[rxnorm_analysis.rxnorm_conf>0.4].head(20)

Unnamed: 0,coords,chunk,entity,rxnorm_conf,rxnorm_opts
0,0::0::10,pentamidine,DrugChem,0.5925,"[(861601, Pentamidine Isethionate 300 MG Injection), (861597, Pentamidine Isethionate 50 MG/ML Inhalation Solution), (755627, Chloroquine 5 MG/ML Oral Solution), (855624, Dibromopropamidine isethionate 1 MG/ML Ophthalmic Solution), (1119497, chlo..."
1,0::37::47,pentamidine,DrugChem,0.5925,"[(861601, Pentamidine Isethionate 300 MG Injection), (861597, Pentamidine Isethionate 50 MG/ML Inhalation Solution), (755627, Chloroquine 5 MG/ML Oral Solution), (855624, Dibromopropamidine isethionate 1 MG/ML Ophthalmic Solution), (1119497, chlo..."
55,3::278::287,creatinine,DrugChem,0.9996,"[(849628, Creatinine 800 MG Oral Capsule), (252180, Urea 10 MG/ML Topical Lotion), (424168, Urea 30 MG/ML Topical Lotion), (251705, Urea 20 MG/ML Topical Lotion), (245052, Urea 200 MG/ML Oral Solution)]"
58,7::83::93,cholesterol,DrugChem,0.5609,"[(2104173, beta Sitosterol 35 MG Oral Tablet), (832876, phytosterol esters 500 MG Oral Capsule), (637208, phytosterol esters 650 MG Oral Capsule), (411217, Lecithin 228 MG Oral Capsule), (1737442, amphotericin B lipid complex 5 MG/ML Injection)]"
59,10::397::406,creatinine,DrugChem,0.9996,"[(849628, Creatinine 800 MG Oral Capsule), (252180, Urea 10 MG/ML Topical Lotion), (424168, Urea 30 MG/ML Topical Lotion), (251705, Urea 20 MG/ML Topical Lotion), (245052, Urea 200 MG/ML Oral Solution)]"
83,12::328::335,mestinon,DrugChem,0.4385,"[(2099309, moxetumomab pasudotox-tdfk 1 MG Injection), (886677, Clidinium bromide 2.5 MG Oral Capsule), (415693, Heparinoids 0.1 UNT/MG Topical Gel), (204558, Peptide Hydrolases 82 UNT/MG Topical Ointment), (1659998, ANTI-INHIBITOR COAGULANT COMP..."
84,12::372::379,mestinon,DrugChem,0.4385,"[(2099309, moxetumomab pasudotox-tdfk 1 MG Injection), (886677, Clidinium bromide 2.5 MG Oral Capsule), (415693, Heparinoids 0.1 UNT/MG Topical Gel), (204558, Peptide Hydrolases 82 UNT/MG Topical Ointment), (1659998, ANTI-INHIBITOR COAGULANT COMP..."
92,15::73::82,creatinine,DrugChem,0.9996,"[(849628, Creatinine 800 MG Oral Capsule), (252180, Urea 10 MG/ML Topical Lotion), (424168, Urea 30 MG/ML Topical Lotion), (251705, Urea 20 MG/ML Topical Lotion), (245052, Urea 200 MG/ML Oral Solution)]"


# Snomed Resolver

In [0]:

snomed_ner_converter = NerConverterInternal() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("greedy_chunk")\
  .setWhiteList(['PROBLEM','TEST'])

chunk_embeddings = ChunkEmbeddings()\
  .setInputCols('greedy_chunk', 'embeddings')\
  .setOutputCol('chunk_embeddings')

snomed_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models")\
    .setInputCols("token","chunk_embeddings").setOutputCol("snomed_resolution")


pipeline_snomed = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    stopwords,
    word_embeddings,
    clinical_ner,
    snomed_ner_converter,
    chunk_embeddings,
    snomed_resolver
  ])

model_snomed = pipeline_snomed.fit(data)


In [0]:
snomed_output = model_snomed.transform(data)

snomed_output.write.mode("overwrite").save("snomed_temp")

snomed_output = spark.read.load("snomed_temp")

In [0]:
snomed_output.select(F.explode(F.arrays_zip("greedy_chunk.result","greedy_chunk.metadata","snomed_resolution.result","snomed_resolution.metadata")).alias("snomed_result")) \
    .select(F.expr("snomed_result['0']").alias("chunk"),
            F.expr("snomed_result['1'].entity").alias("entity"),
            F.expr("snomed_result['3'].all_k_resolutions").alias("target_text"),
            F.expr("snomed_result['2']").alias("code"),
            F.expr("snomed_result['3'].confidence").alias("confidence")).show(truncate = 100)

# Snomed with SentenceEntityResolver (BioBert) (after Spark NLP 2.7)

In [0]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")\

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

ner_converter = NerConverter() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")\
  .setWhiteList(['PROBLEM'])

c2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") 

bert_embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased")\
  .setInputCols(["ner_chunk_doc"])\
  .setOutputCol("bert_embeddings")

snomed_resolution = SentenceEntityResolverModel.pretrained("biobertresolve_snomed_findings", "en", "clinical/models") \
  .setInputCols(["ner_chunk", "bert_embeddings"]) \
  .setOutputCol("snomed_code")

pipeline_snomed = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    c2doc,
    bert_embeddings,
    snomed_resolution
  ])


In [0]:

clinical_note = (
    'A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years '
    'prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior '
    'episode of HTG-induced pancreatitis three years prior to presentation, associated '
    'with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, '
    'presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. '
    'Two weeks prior to presentation, she was treated with a five-day course of amoxicillin '
    'for a respiratory tract infection. She was on metformin, glipizide, and dapagliflozin '
    'for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months '
    'at the time of presentation. Physical examination on presentation was significant for dry oral mucosa; '
    'significantly, her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent '
    'laboratory findings on admission were: serum glucose 111 mg/dl, bicarbonate 18 mmol/l, anion gap 20, '
    'creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, glycated hemoglobin (HbA1c) '
    '10%, and venous pH 7.27. Serum lipase was normal at 43 U/L. Serum acetone levels could not be assessed '
    'as blood samples kept hemolyzing due to significant lipemia. The patient was initially admitted for '
    'starvation ketosis, as she reported poor oral intake for three days prior to admission. However, '
    'serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL, the anion gap '
    'was still elevated at 21, serum bicarbonate was 16 mmol/L, triglyceride level peaked at 2050 mg/dL, and '
    'lipase was 52 U/L. The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - '
    'the original sample was centrifuged and the chylomicron layer removed prior to analysis due to '
    'interference from turbidity caused by lipemia again. The patient was treated with an insulin drip '
    'for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL, within '
    '24 hours. Her euDKA was thought to be precipitated by her respiratory tract infection in the setting '
    'of SGLT2 inhibitor use. The patient was seen by the endocrinology service and she was discharged on '
    '40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg '
    'two times a day. It was determined that all SGLT2 inhibitors should be discontinued indefinitely. She '
    'had close follow-up with endocrinology post discharge.'
)

data_ner = spark.createDataFrame([[clinical_note]]).toDF("text")

snomed_output = pipeline_snomed.fit(data_ner).transform(data_ner)


In [0]:
snomed_output.select(F.explode(F.arrays_zip("ner_chunk.result","ner_chunk.metadata","snomed_code.result","snomed_code.metadata")).alias("snomed_result")) \
    .select(F.expr("snomed_result['0']").alias("chunk"),
            F.expr("snomed_result['1'].entity").alias("entity"),
            F.expr("snomed_result['3'].all_k_resolutions").alias("target_text"),
            F.expr("snomed_result['2']").alias("snomed_code"),
            F.expr("snomed_result['3'].confidence").alias("distance")).show(truncate = 100)

### with SNOMED INT

In [0]:

snomed_resolution_int = SentenceEntityResolverModel.pretrained("biobertresolve_snomed_findings_int", "en", "clinical/models") \
  .setInputCols(["ner_chunk", "bert_embeddings"]) \
  .setOutputCol("snomed_code_int")

pipeline_snomed_int = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    c2doc,
    bert_embeddings,
    snomed_resolution_int
  ])

snomed_output_int = pipeline_snomed_int.fit(data_ner).transform(data_ner)


In [0]:
snomed_output_int.select(F.explode(F.arrays_zip("ner_chunk.result","ner_chunk.metadata","snomed_code_int.result","snomed_code_int.metadata")).alias("snomed_result")) \
    .select(F.expr("snomed_result['0']").alias("chunk"),
            F.expr("snomed_result['1'].entity").alias("entity"),
            F.expr("snomed_result['3'].all_k_resolutions").alias("target_text"),
            F.expr("snomed_result['2']").alias("snomed_code"),
            F.expr("snomed_result['3'].confidence").alias("distance")).show(truncate = 100)

In [0]:

snomed_resolution_int = SentenceEntityResolverModel.pretrained("biobertresolve_snomed_findings", "en", "clinical/models") \
  .setInputCols(["ner_chunk", "bert_embeddings"]) \
  .setOutputCol("snomed_code_int")


# SentenceEntity Resolver with BioBert Sentence Embeddings (s-Bert) finetuned on MedNLI (requires Spark NLP 2.6.4 and Spark NLP JSL 2.7.1)


**Warning**: **If you get an error related to Java port not found 55, it is probably because that the Colab memory cannot handle the model and the Spark session died. In that case, try on a larger machine or restart the kernel at the top and then come back here and rerun. **

- sbiobertresolve_icd10cm 
- sbiobertresolve_icd10pcs
- sbiobertresolve_snomed_findings (with clinical_findings concepts from CT version)
- sbiobertresolve_snomed_findings_int  (with clinical_findings concepts from INT version)
- sbiobertresolve_snomed_auxConcepts (with Morph Abnormality, Procedure, Substance, Physical Object, Body Structure concepts from CT version)
- sbiobertresolve_snomed_auxConcepts_int  (with Morph Abnormality, Procedure, Substance, Physical Object, Body Structure concepts from INT version)
- sbiobertresolve_rxnorm
- sbiobertresolve_icdo
- sbiobertresolve_cpt

In [0]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")\

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

clinical_ner = NerDLModel.pretrained("ner_clinical_large", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

ner_converter = NerConverter() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")\

c2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc") 

sbert_embedder = BertSentenceEmbeddings\
      .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
      .setInputCols(["ner_chunk_doc"])\
      .setOutputCol("sbert_embeddings")

icd10cm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm","en", "clinical/models") \
  .setInputCols(["ner_chunk", "sbert_embeddings"]) \
  .setOutputCol("icd10cm_code")\
  .setDistanceFunction("EUCLIDEAN")
  
sbert_pipeline_icd10cm = Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        c2doc,
        sbert_embedder,
        icd10cm_resolver])

In [0]:

text = 'This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .'

data_ner = spark.createDataFrame([[text]]).toDF("text")

sbert_models = sbert_pipeline_icd10cm.fit(data_ner)

sbert_outputs = sbert_models.transform(data_ner)

from pyspark.sql import functions as F

icd10cm_sdf = sbert_outputs.select(F.explode(F.arrays_zip("ner_chunk.result","ner_chunk.metadata","icd10cm_code.result","icd10cm_code.metadata","ner_chunk.begin","ner_chunk.end")).alias("icd10cm_code")) \
    .select(F.expr("icd10cm_code['0']").alias("chunk"),
            F.expr("icd10cm_code['4']").alias("begin"),
            F.expr("icd10cm_code['5']").alias("end"),
            F.expr("icd10cm_code['1'].entity").alias("entity"),
            F.expr("icd10cm_code['2']").alias("code"),
            F.expr("icd10cm_code['3'].confidence").alias("confidence"),
            F.expr("icd10cm_code['3'].all_k_resolutions").alias("all_k_resolutions"),
            F.expr("icd10cm_code['3'].all_k_results").alias("all_k_codes"))

icd10cm_sdf.show(10)


In [0]:
snomed_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_findings", "en", "clinical/models") \
  .setInputCols(["ner_chunk", "sbert_embeddings"]) \
  .setOutputCol("snomed_ct_code")\
  .setDistanceFunction("EUCLIDEAN")
  
sbert_pipeline_snomed = Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        c2doc,
        sbert_embedder,
        snomed_resolver])

In [0]:

text = 'This is an 82 - year-old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .'

data_ner = spark.createDataFrame([[text]]).toDF("text")

sbert_models = sbert_pipeline_snomed.fit(data_ner)

sbert_outputs = sbert_models.transform(data_ner)

from pyspark.sql import functions as F

snomed_sdf = sbert_outputs.select(F.explode(F.arrays_zip("ner_chunk.result","ner_chunk.metadata","snomed_ct_code.result","snomed_ct_code.metadata","ner_chunk.begin","ner_chunk.end")).alias("snomed_ct_code")) \
    .select(F.expr("snomed_ct_code['0']").alias("chunk"),
            F.expr("snomed_ct_code['4']").alias("begin"),
            F.expr("snomed_ct_code['5']").alias("end"),
            F.expr("snomed_ct_code['1'].entity").alias("entity"),
            F.expr("snomed_ct_code['2']").alias("code"),
            F.expr("snomed_ct_code['3'].confidence").alias("confidence"),
            F.expr("snomed_ct_code['3'].all_k_resolutions").alias("all_k_resolutions"),
            F.expr("snomed_ct_code['3'].all_k_results").alias("all_k_codes"))

snomed_sdf.show(10)


In [0]:
jsl_ner = NerDLModel.pretrained("jsl_ner_wip_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")


sbert_pipeline_snomed = Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        jsl_ner,
        ner_converter,
        c2doc,
        sbert_embedder,
        snomed_resolver])

In [0]:

text = 'This is an 67 year-old male with a history of prior tobacco use, hypertension , chronic kidney deficiency, COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret\'s Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU .'

data_ner = spark.createDataFrame([[text]]).toDF("text")

sbert_models = sbert_pipeline_snomed.fit(data_ner)

sbert_outputs = sbert_models.transform(data_ner)

from pyspark.sql import functions as F

snomed_sdf = sbert_outputs.select(F.explode(F.arrays_zip("ner_chunk.result","ner_chunk.metadata","snomed_ct_code.result","snomed_ct_code.metadata","ner_chunk.begin","ner_chunk.end")).alias("snomed_ct_code")) \
    .select(F.expr("snomed_ct_code['0']").alias("chunk"),
            F.expr("snomed_ct_code['4']").alias("begin"),
            F.expr("snomed_ct_code['5']").alias("end"),
            F.expr("snomed_ct_code['1'].entity").alias("entity"),
            F.expr("snomed_ct_code['2']").alias("code"),
            F.expr("snomed_ct_code['3'].confidence").alias("confidence"),
            F.expr("snomed_ct_code['3'].all_k_resolutions").alias("all_k_resolutions"),
            F.expr("snomed_ct_code['3'].all_k_results").alias("all_k_codes"))

snomed_sdf.show(10, truncate=50)


End of Notebook # 3