
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_POSOLOGY.ipynb)

# **Detect drugs and prescriptions**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload license_keys.json to the folder that opens. Otherwise, you can look at the example outputs at the bottom of the notebook.

# **Colab Setup**

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType, IntegerType

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 4.2.8
Spark NLP_JSL Version : 4.2.8


📌**Models:**


```
ner_posology
ner_posology_small
ner_posology_large
ner_posology_greedy
ner_posology_experimental

ner_drugs_large
ner_drugs_greedy

ner_jsl
ner_jsl_enriched

ner_clinical
ner_clinical_large

```



# **Define Spark NLP pipeline**

In [4]:
model_list = ['ner_posology',
              'ner_posology_small', 
              'ner_posology_large', 
              'ner_posology_greedy',
              "ner_posology_experimental",
              'ner_drugs_large', 
              'ner_drugs_greedy',
              'ner_jsl',
              'ner_jsl_enriched', 
              'ner_clinical',
              'ner_clinical_large']

In [5]:
#basic_stages👇🏻

documentAssembler = DocumentAssembler() \
    .setInputCol('text')\
    .setOutputCol('document')

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

tokenizer = Tokenizer()\
    .setInputCols(['sentence']) \
    .setOutputCol('token')

word_embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models') \
    .setInputCols(['sentence', 'token']) \
    .setOutputCol('embeddings')
    
ner_converter = NerConverter() \
    .setInputCols(['sentence', 'token', 'ner']) \
    .setOutputCol('ner_chunk')
    
#select_ner_models👇🏻

def pipeline(model_name): 
    clinical_ner = MedicalNerModel.pretrained(model_name, 'en', 'clinical/models') \
        .setInputCols(['sentence', 'token', 'embeddings']) \
        .setOutputCol('ner')
    
    if model_name == "ner_clinical" or model_name == "ner_clinical_large":
        ner_converter = NerConverter() \
            .setInputCols(['sentence', 'token', 'ner']) \
            .setOutputCol('ner_chunk')\
            .setWhiteList(['TREATMENT'])
            
    elif model_name == 'ner_jsl':
        ner_converter = NerConverter() \
            .setInputCols(['sentence', 'token', 'ner']) \
            .setOutputCol('ner_chunk')\
            .setWhiteList(["Drug_BrandName", "Drug_Ingredient", "Dosage", "Frequency", "Route", "Strength"])
            
    elif model_name == 'ner_jsl_enriched':
        ner_converter = NerConverter() \
            .setInputCols(['sentence', 'token', 'ner']) \
            .setOutputCol('ner_chunk')\
            .setWhiteList(["Drug_BrandName", "Duration", "Frequency", "Treatment", "Dosage", "Route", "Strength", "Drug_Ingredient", "Form"])
        
    else:
        ner_converter = NerConverter() \
            .setInputCols(['sentence', 'token', 'ner']) \
            .setOutputCol('ner_chunk')
            

    nlpPipeline = Pipeline(
        stages=[
            documentAssembler, 
            sentenceDetector,
            tokenizer,
            word_embeddings,
            clinical_ner,
            ner_converter
            ])

    pipelineModel = nlpPipeline.fit(spark.createDataFrame([['']]).toDF("text"))
    return pipelineModel

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


# **Sample Text**

In [6]:
sample_texts = """The patient is a 30-year-old female with a long history of diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. She did receive a course of Bactrim for 14 days for UTI. Evidently, at some point in time, the patient was noted to develop a pressure-type wound on the sole of her left foot and left great toe. She was also noted to have a large sacral wound; this is in a similar location with previous laminectomy, and this continues to receive daily care. The patient was transferred secondary to inability to participate in full physical and occupational therapy and continue medical management of her diabetes, the sacral decubitus, left foot pressure wound, and associated complications of diabetes. She is given Fragmin 5000 units subcutaneously daily, OxyContin 30 mg p.o. q.12 h., folic acid 1 mg daily, levothyroxine 0.1 mg p.o. , Prevacid 30 mg daily, Avandia 4 mg daily, Norvasc 10 mg daily, Lexapro 20 mg daily, aspirin 81 mg daily, Senna 2 tablets p.o. q.a.m., Neurontin 400 mg p.o. t.i.d., magnesium citrate 1 bottle p.o , Wellbutrin 100 mg p.o. daily, and Bactrim DS b.i.d."""


In [7]:
df = spark.createDataFrame([[sample_texts]]).toDF("text")
df.show(truncate = 100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|The patient is a 30-year-old female with a long history of diabetes, type 2; coronary artery dise...|
+----------------------------------------------------------------------------------------------------+



# **Models**

## 🔎`ner_posology`

In [8]:
result = pipeline("ner_posology").transform(df)

ner_posology download started this may take some time.
[OK!]


In [9]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(15,truncate=False)
    


+--------------+---------+
|chunk         |ner_label|
+--------------+---------+
|Bactrim       |DRUG     |
|for 14 days   |DURATION |
|Fragmin       |DRUG     |
|5000 units    |DOSAGE   |
|subcutaneously|ROUTE    |
|daily         |FREQUENCY|
|OxyContin     |DRUG     |
|30 mg         |STRENGTH |
|p.o           |ROUTE    |
|q.12 h        |FREQUENCY|
|folic acid    |DRUG     |
|1 mg          |STRENGTH |
|daily         |FREQUENCY|
|levothyroxine |DRUG     |
|0.1 mg        |STRENGTH |
+--------------+---------+
only showing top 15 rows



In [10]:
from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

visualiser.display(result = result.collect()[0] ,label_col = 'ner_chunk', document_col = 'document')

## 🔎`ner_posology_small`

In [11]:
result = pipeline("ner_posology_small").transform(df)

ner_posology_small download started this may take some time.
[OK!]


In [12]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(15,truncate=False)
    


+--------------+---------+
|chunk         |ner_label|
+--------------+---------+
|Bactrim       |DRUG     |
|for 14 days   |DURATION |
|Fragmin       |DRUG     |
|5000 units    |DOSAGE   |
|subcutaneously|ROUTE    |
|daily         |FREQUENCY|
|OxyContin     |DRUG     |
|30 mg         |STRENGTH |
|p.o           |ROUTE    |
|q.12 h.,      |FREQUENCY|
|folic acid    |DRUG     |
|1 mg          |STRENGTH |
|daily         |FREQUENCY|
|levothyroxine |DRUG     |
|0.1 mg        |STRENGTH |
+--------------+---------+
only showing top 15 rows



In [13]:
visualiser = NerVisualizer()

visualiser.display(result = result.collect()[0] ,label_col = 'ner_chunk', document_col = 'document')

## 🔎`ner_posology_large`

In [14]:
result = pipeline("ner_posology_large").transform(df)

ner_posology_large download started this may take some time.
[OK!]


In [15]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(15,truncate=False)
    


+--------------+---------+
|chunk         |ner_label|
+--------------+---------+
|Bactrim       |DRUG     |
|for 14 days   |DURATION |
|Fragmin       |DRUG     |
|5000 units    |DOSAGE   |
|subcutaneously|ROUTE    |
|daily         |FREQUENCY|
|OxyContin     |DRUG     |
|30 mg         |STRENGTH |
|p.o.          |ROUTE    |
|q.12 h        |FREQUENCY|
|folic acid    |DRUG     |
|1 mg          |STRENGTH |
|daily         |FREQUENCY|
|levothyroxine |DRUG     |
|0.1 mg        |STRENGTH |
+--------------+---------+
only showing top 15 rows



In [16]:
visualiser = NerVisualizer()

visualiser.display(result = result.collect()[0] ,label_col = 'ner_chunk', document_col = 'document')

## 🔎`ner_posology_greedy`

In [17]:
result = pipeline("ner_posology_greedy").transform(df)

ner_posology_greedy download started this may take some time.
[OK!]


In [18]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(15,truncate=False)
  

+---------------------------------+---------+
|chunk                            |ner_label|
+---------------------------------+---------+
|Bactrim                          |DRUG     |
|for 14 days                      |DURATION |
|Fragmin 5000 units subcutaneously|DRUG     |
|daily                            |FREQUENCY|
|OxyContin 30 mg p.o              |DRUG     |
|q.12 h                           |FREQUENCY|
|folic acid 1 mg                  |DRUG     |
|daily                            |FREQUENCY|
|levothyroxine 0.1 mg p.o         |DRUG     |
|Prevacid 30 mg                   |DRUG     |
|daily                            |FREQUENCY|
|Avandia 4 mg                     |DRUG     |
|daily                            |FREQUENCY|
|Norvasc 10 mg                    |DRUG     |
|daily                            |FREQUENCY|
+---------------------------------+---------+
only showing top 15 rows



In [19]:
visualiser = NerVisualizer()

visualiser.display(result = result.collect()[0] ,label_col = 'ner_chunk', document_col = 'document')

## 🔎`ner_posology_experimental`

In [20]:
result = pipeline("ner_posology_experimental").transform(df)

ner_posology_experimental download started this may take some time.
[OK!]


In [21]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(15,truncate=False)
  

+--------------+---------+
|chunk         |ner_label|
+--------------+---------+
|Bactrim       |Drug     |
|for 14 days   |Duration |
|Fragmin       |Drug     |
|5000 units    |Strength |
|subcutaneously|Route    |
|daily         |Frequency|
|OxyContin     |Drug     |
|30 mg         |Strength |
|p.o           |Route    |
|q.12 h        |Frequency|
|folic acid    |Drug     |
|1 mg          |Strength |
|daily         |Frequency|
|levothyroxine |Drug     |
|0.1 mg        |Strength |
+--------------+---------+
only showing top 15 rows



In [22]:
visualiser = NerVisualizer()

visualiser.display(result = result.collect()[0] ,label_col = 'ner_chunk', document_col = 'document')

## 🔎`ner_drugs_large`

In [23]:
result = pipeline("ner_drugs_large").transform(df)

ner_drugs_large download started this may take some time.
[OK!]


In [24]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(15,truncate=False)
  

+---------------------------------+---------+
|chunk                            |ner_label|
+---------------------------------+---------+
|Bactrim                          |DRUG     |
|Fragmin 5000 units subcutaneously|DRUG     |
|OxyContin 30 mg p.o              |DRUG     |
|folic acid 1 mg                  |DRUG     |
|levothyroxine 0.1 mg p.o         |DRUG     |
|Prevacid 30 mg                   |DRUG     |
|Avandia 4 mg                     |DRUG     |
|Norvasc 10 mg                    |DRUG     |
|Lexapro 20 mg                    |DRUG     |
|aspirin 81 mg                    |DRUG     |
|Senna 2 tablets p.o              |DRUG     |
|Neurontin 400 mg p.o             |DRUG     |
|magnesium citrate 1 bottle p.o   |DRUG     |
|Wellbutrin 100 mg p.o            |DRUG     |
|Bactrim DS                       |DRUG     |
+---------------------------------+---------+



In [25]:
visualiser = NerVisualizer()

visualiser.display(result = result.collect()[0] ,label_col = 'ner_chunk', document_col = 'document')

## 🔎`ner_drugs_greedy`

In [26]:
result = pipeline("ner_drugs_greedy").transform(df)

ner_drugs_greedy download started this may take some time.
[OK!]


In [27]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(15,truncate=False)
  

+---------------------------------+---------+
|chunk                            |ner_label|
+---------------------------------+---------+
|Bactrim                          |DRUG     |
|Fragmin 5000 units subcutaneously|DRUG     |
|OxyContin 30 mg p.o              |DRUG     |
|folic acid 1 mg                  |DRUG     |
|levothyroxine 0.1 mg p.o         |DRUG     |
|Prevacid 30 mg                   |DRUG     |
|Avandia 4 mg                     |DRUG     |
|Norvasc 10 mg                    |DRUG     |
|Lexapro 20 mg                    |DRUG     |
|aspirin 81 mg                    |DRUG     |
|Senna 2 tablets p.o              |DRUG     |
|Neurontin 400 mg p.o             |DRUG     |
|magnesium citrate 1 bottle p.o   |DRUG     |
|Wellbutrin 100 mg p.o            |DRUG     |
|Bactrim DS                       |DRUG     |
+---------------------------------+---------+



In [28]:
visualiser = NerVisualizer()

visualiser.display(result = result.collect()[0] ,label_col = 'ner_chunk', document_col = 'document')

## 🔎`ner_jsl`

In [29]:
result = pipeline("ner_jsl").transform(df)

ner_jsl download started this may take some time.
[OK!]


In [30]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(15,truncate=False)
  

+--------------+---------------+
|chunk         |ner_label      |
+--------------+---------------+
|Bactrim       |Drug_BrandName |
|daily         |Frequency      |
|Fragmin       |Drug_BrandName |
|5000 units    |Dosage         |
|subcutaneously|Route          |
|daily         |Frequency      |
|OxyContin     |Drug_BrandName |
|30 mg         |Strength       |
|p.o           |Route          |
|q.12 h        |Frequency      |
|folic acid    |Drug_Ingredient|
|1 mg          |Strength       |
|daily         |Frequency      |
|levothyroxine |Drug_Ingredient|
|0.1 mg        |Strength       |
+--------------+---------------+
only showing top 15 rows



In [31]:
visualiser = NerVisualizer()

visualiser.display(result = result.collect()[0] ,label_col = 'ner_chunk', document_col = 'document')

## 🔎`ner_jsl_enriched`

In [32]:
result = pipeline("ner_jsl_enriched").transform(df)

ner_jsl_enriched download started this may take some time.
[OK!]


In [33]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(15,truncate=False)
  

+--------------------+---------------+
|chunk               |ner_label      |
+--------------------+---------------+
|Bactrim             |Drug_BrandName |
|for 14 days         |Duration       |
|daily               |Frequency      |
|occupational therapy|Treatment      |
|Fragmin             |Drug_BrandName |
|5000 units          |Dosage         |
|subcutaneously      |Route          |
|daily               |Frequency      |
|OxyContin           |Drug_BrandName |
|30 mg               |Strength       |
|p.o                 |Route          |
|q.12 h              |Frequency      |
|folic acid          |Drug_Ingredient|
|1 mg                |Strength       |
|daily               |Frequency      |
+--------------------+---------------+
only showing top 15 rows



In [34]:
visualiser = NerVisualizer()

visualiser.display(result = result.collect()[0] ,label_col = 'ner_chunk', document_col = 'document')

## 🔎`ner_clinical`

In [35]:
result = pipeline("ner_clinical").transform(df)

ner_clinical download started this may take some time.
[OK!]


In [36]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(15,truncate=False)
  

+--------------------------------------+---------+
|chunk                                 |ner_label|
+--------------------------------------+---------+
|Bactrim                               |TREATMENT|
|previous laminectomy                  |TREATMENT|
|full physical and occupational therapy|TREATMENT|
|medical management                    |TREATMENT|
|Fragmin                               |TREATMENT|
|OxyContin                             |TREATMENT|
|folic acid                            |TREATMENT|
|levothyroxine                         |TREATMENT|
|Prevacid                              |TREATMENT|
|Avandia                               |TREATMENT|
|Norvasc                               |TREATMENT|
|Lexapro                               |TREATMENT|
|aspirin                               |TREATMENT|
|Senna                                 |TREATMENT|
|Neurontin                             |TREATMENT|
+--------------------------------------+---------+
only showing top 15 rows



In [37]:
visualiser = NerVisualizer()

visualiser.display(result = result.collect()[0] ,label_col = 'ner_chunk', document_col = 'document')

## 🔎`ner_clinical_large`

In [38]:
result = pipeline("ner_clinical_large").transform(df)

ner_clinical_large download started this may take some time.
[OK!]


In [39]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(15,truncate=False)
  

+--------------------------------------+---------+
|chunk                                 |ner_label|
+--------------------------------------+---------+
|Bactrim                               |TREATMENT|
|previous laminectomy                  |TREATMENT|
|full physical and occupational therapy|TREATMENT|
|medical management                    |TREATMENT|
|Fragmin                               |TREATMENT|
|OxyContin                             |TREATMENT|
|folic acid                            |TREATMENT|
|levothyroxine                         |TREATMENT|
|Prevacid                              |TREATMENT|
|Avandia                               |TREATMENT|
|Norvasc                               |TREATMENT|
|Lexapro                               |TREATMENT|
|aspirin                               |TREATMENT|
|Senna                                 |TREATMENT|
|Neurontin                             |TREATMENT|
+--------------------------------------+---------+
only showing top 15 rows



In [40]:
visualiser = NerVisualizer()

visualiser.display(result = result.collect()[0] ,label_col = 'ner_chunk', document_col = 'document')