
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_JSL.ipynb)

# **Detect Clinical Entities**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload license_keys.json to the folder that opens. Otherwise, you can look at the example outputs at the bottom of the notebook.

# **Colab Setup**

In [1]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

Saving 3.5.3-key.json to 3.5.3-key.json


# **Install dependencies**

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

# **Import dependencies into Python and start the Spark session**

In [3]:
import json
import os

from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession

import sparknlp
import sparknlp_jsl

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
from sparknlp.util import *
from sparknlp.pretrained import ResourceDownloader
from pyspark.sql import functions as F
from sparknlp_display import NerVisualizer

import pandas as pd

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

import string
import numpy as np

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(secret = SECRET, params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 3.4.4
Spark NLP_JSL Version : 3.5.3


# **Models:**


```
ner_jsl
ner_jsl_slim
ner_jsl_enriched
ner_jsl_greedy

ner_jsl_biobert
ner_jsl_greedy_biobert

jsl_ner_wip_clinical
jsl_ner_wip_modifier_clinical

bert_token_classifier_ner_jsl
bert_token_classifier_ner_jsl_slim

```



# **🔎Sample Text**

In [4]:
sample_text = """The patient is a 30-year-old female with a long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with her previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, Xenaderm to wounds topically b.i.d., Lantus 40 units subcutaneously at bedtime, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. daily, Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., Percocet 5/325 mg 2 tablets q.4 h. p.r.n., magnesium citrate 1 bottle p.o. p.r.n., sliding scale coverage insulin, Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d."""




# ***🔎For models:***

- ***ner_jsl***
- ***ner_jsl_slim***
- ***ner_jsl_enriched***
- ***ner_jsl_greedy***
- ***jsl_ner_wip_clinical***
- ***jsl_ner_wip_modifier_clinical***

### **Define Spark NLP pipeline**

In [7]:
jsl_model_list = ["ner_jsl", 
                  "ner_jsl_slim", 
                  "ner_jsl_enriched", 
                  "ner_jsl_greedy", 
                  "jsl_ner_wip_clinical", 
                  "jsl_ner_wip_modifier_clinical"]

In [8]:
documentAssembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
        .setInputCols(["document"]) \
        .setOutputCol("sentence") 

tokenizer = Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

jsl_ner_converter = NerConverter() \
        .setInputCols(["sentence", "token", "jsl_ner"]) \
        .setOutputCol("ner_chunk")

embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
                .setInputCols(["sentence", "token"])\
                .setOutputCol("embeddings")
  
for model_name in jsl_model_list:

  jsl_ner = MedicalNerModel.pretrained(model_name, "en", "clinical/models") \
        .setInputCols(["sentence", "token", "embeddings"]) \
        .setOutputCol("jsl_ner")


  jsl_ner_pipeline = Pipeline(stages=[documentAssembler, 
                                      sentenceDetector,
                                      tokenizer,
                                      embeddings,
                                      jsl_ner,
                                      jsl_ner_converter])


  jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))
  light_model = LightPipeline(jsl_ner_model)
  light_result = light_model.fullAnnotate(sample_text)

  print("\n\n\n")
  print(f"***************  The visualization results for {model_name} ***************")
  print("\n\n\n")
  visualiser = NerVisualizer()
  visualiser.display(light_result[0], label_col='ner_chunk', document_col='document')
  print("\n\n\n")

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]




***************  The visualization results for ner_jsl ***************










ner_jsl_slim download started this may take some time.
[OK!]




***************  The visualization results for ner_jsl_slim ***************










ner_jsl_enriched download started this may take some time.
[OK!]




***************  The visualization results for ner_jsl_enriched ***************










ner_jsl_greedy download started this may take some time.
[OK!]




***************  The visualization results for ner_jsl_greedy ***************










jsl_ner_wip_clinical download started this may take some time.
[OK!]




***************  The visualization results for jsl_ner_wip_clinical ***************










jsl_ner_wip_modifier_clinical download started this may take some time.
[OK!]




***************  The visualization results for jsl_ner_wip_modifier_clinical ***************













# ***🔎For models:***

- ***ner_jsl_biobert***
- ***ner_jsl_greedy_biobert***

### **Define Spark NLP pipeline**

In [9]:
biobert_jsl_model_list = ["ner_jsl_biobert", 
                          "ner_jsl_greedy_biobert"]

In [10]:
documentAssembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
        .setInputCols(["document"]) \
        .setOutputCol("sentence") 

tokenizer = Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

embeddings = BertEmbeddings.pretrained('biobert_pubmed_base_cased')\
        .setInputCols(["sentence", "token"])\
        .setOutputCol("embeddings")

jsl_ner_converter = NerConverter() \
        .setInputCols(["sentence", "token", "jsl_ner"]) \
        .setOutputCol("ner_chunk")

for model_name in biobert_jsl_model_list:

  jsl_ner = MedicalNerModel.pretrained(model_name, "en", "clinical/models") \
        .setInputCols(["sentence", "token", "embeddings"]) \
        .setOutputCol("jsl_ner")


  jsl_ner_pipeline = Pipeline(stages=[documentAssembler, 
                                      sentenceDetector,
                                      tokenizer,
                                      embeddings,
                                      jsl_ner,
                                      jsl_ner_converter])


  biobert_jsl_ner_model = jsl_ner_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))
  light_model = LightPipeline(biobert_jsl_ner_model)
  light_result = light_model.fullAnnotate(sample_text)

  print("\n\n\n")
  print(f"***************  The visualization results for {model_name} ***************")
  print("\n\n\n")
  visualiser = NerVisualizer()
  visualiser.display(light_result[0], label_col='ner_chunk', document_col='document')
  print("\n\n\n")


sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
ner_jsl_biobert download started this may take some time.
[OK!]




***************  The visualization results for ner_jsl_biobert ***************










ner_jsl_greedy_biobert download started this may take some time.
[OK!]




***************  The visualization results for ner_jsl_greedy_biobert ***************














# ***🔎For models:***

- ***bert_token_classifier_ner_jsl***
- ***bert_token_classifier_ner_jsl_slim***



### **Define Spark NLP pipeline**

In [11]:
bert_jsl_ner_model_list = ["bert_token_classifier_ner_jsl", 
                           "bert_token_classifier_ner_jsl_slim"]

In [12]:
documentAssembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained() \
        .setInputCols(["document"]) \
        .setOutputCol("sentence") 

tokenizer = Tokenizer()\
        .setInputCols("sentence")\
        .setOutputCol("token")

ner_converter = NerConverter()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

for model_name in bert_jsl_ner_model_list:
  tokenClassifier = MedicalBertForTokenClassifier.pretrained(model_name, "en", "clinical/models")\
        .setInputCols(["token", "sentence"])\
        .setOutputCol("ner")\
        .setCaseSensitive(True)

  bert_jsl_ner_pipeline =  Pipeline(stages=[
                                        documentAssembler, 
                                        sentenceDetector, 
                                        tokenizer, 
                                        tokenClassifier, 
                                        ner_converter])


  bert_jsl_ner_model = bert_jsl_ner_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))
  light_model = LightPipeline(bert_jsl_ner_model)
  light_result = light_model.fullAnnotate(sample_text)

  print("\n\n\n")
  print(f"***************  The visualization results for {model_name} ***************")
  print("\n\n\n")
  visualiser = NerVisualizer()
  visualiser.display(light_result[0], label_col='ner_chunk', document_col='document')
  print("\n\n\n")

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
bert_token_classifier_ner_jsl download started this may take some time.
[OK!]




***************  The visualization results for bert_token_classifier_ner_jsl ***************










bert_token_classifier_ner_jsl_slim download started this may take some time.
[OK!]




***************  The visualization results for bert_token_classifier_ner_jsl_slim ***************










