![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/34.0.Clinical_Medication_Use_Case.ipynb)

# **CLINICAL MEDICATION**

**PROBLEM:** Your company has a clinical dataset, and you need to extract information to address the following questions for developing a new solution:

- What are the most commonly used medications in this dataset?
- What are the dosage, frequency, strength, and route details for these medications?
- Which medications are currently in use, and which ones were used previously?
- What actions do these medications have?
- For which treatment are these medications used?
- What are the RxNorm and NDC codes for these medications?
- Can you map the RxNorm codes to NDC, UMLS and SNOMED codes?
- Is there a way to obtain the adverse events associated with these medications?
- To get all these informations at once, can you create a model for a real-time application that we will serve on our web page?

Solutions are here! At the end of this notebook, you will be able to answer all these questions and more.


## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

## Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=True
nlp.install(refresh_install=True)

In [4]:
from johnsnowlabs import nlp, medical
import functools
import pandas as pd
import numpy as np
from scipy import spatial
from pyspark.sql.types import StringType
import pyspark.sql.functions as F

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/5.4.0.spark_nlp_for_healthcare.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.4.0, 💊Spark-Healthcare==5.4.0, running on ⚡ PySpark==3.4.0


In [5]:
spark


# Medication NER Models

The NER models include different entity groups and levels of granularity. If you want to extract as much information as possible from clinical texts, then `ner_jsl` would be the best option for the begining, as it can detect more than 80 different clinical entities. However, you might consider employing alternative models based on your specific requirements. Additionally, when utilizing "greedy" models, they chunk together drugs, dosage, form, strength, and route when they appear together, resulting in a bigger chunk. Below is a compilation of NER models suitable for tasks related to medication.


**NER Models List**

|index|model|related entities|
|:-----:|:-----:|:-----:|
| 1| [ner_posology](https://nlp.johnsnowlabs.com/2020/04/15/ner_posology_en.html)  | 'DRUG', 'DOSAGE', 'DURATION', 'FORM', 'FREQUENCY', 'ROUTE', 'STRENGTH'
2| [ner_posology_greedy](https://nlp.johnsnowlabs.com/2021/03/31/ner_posology_greedy_en.html)  | 'DRUG', 'DOSAGE', 'DURATION', 'FORM', 'FREQUENCY', 'ROUTE', 'STRENGTH'
3| [ner_jsl](https://nlp.johnsnowlabs.com/2022/10/19/ner_jsl_en.html)  | 'DRUG_INGREDIENT', 'DRUG_BRANDNAME', 'FREQUENCY', 'STRENGTH', 'ROUTE', 'FORM'
4| [ner_jsl_greedy](https://nlp.johnsnowlabs.com/2021/06/24/ner_jsl_greedy_en.html)  | 'DRUG_INGREDIENT', 'DRUG_BRANDNAME', 'FREQUENCY', 'STRENGTH', 'ROUTE', 'FORM'
5| [drug_matcher](https://nlp.johnsnowlabs.com/2024/03/19/drug_matcher_en.html)  | 'DRUG'



### Granular Medication NER Models
These models can detect 'DRUG', 'FREQUENCY', 'STRENGTH', 'ROUTE'', 'DURATION', 'DOSAGE' and 'FORM' entities in a clinical text.

**Initial Stages**

In [6]:
document_assambler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("embeddings")

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


**NER Models**

In [17]:
ner_jsl = medical.NerModel.pretrained("ner_jsl","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_jsl")\
    .setLabelCasing("upper")

jsl_converter_internal = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner_jsl"])\
    .setOutputCol("jsl_drug_chunk")\
    .setWhiteList(['DRUG_INGREDIENT', 'DRUG_BRANDNAME', 'FREQUENCY', 'STRENGTH', 'ROUTE', 'DOSAGE'])\
    .setReplaceLabels({"DRUG_INGREDIENT" : "DRUG", "DRUG_BRANDNAME" : "DRUG"})

ner_posology = medical.NerModel.pretrained("ner_posology","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_posology")

ner_converter_internal = medical.NerConverterInternal()\
    .setInputCols(["sentence","token", "ner_posology"])\
    .setOutputCol("posology_chunk")

text_matcher = medical.TextMatcherModel.pretrained("drug_matcher","en","clinical/models") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("matched_text")\
    .setCaseSensitive(False)\
    .setDelimiter("#")

# merge chunks to get a single result
chunk_merge = medical.ChunkMergeApproach()\
      .setInputCols("posology_chunk", "jsl_drug_chunk", "matched_text")\
      .setOutputCol("medication_chunk")\
      .setOrderingFeatures(["ChunkPrecedence"]) \
      .setChunkPrecedence('ner_source,entity')\
      .setChunkPrecedenceValuePrioritization(["posology_chunk,DRUG", "jsl_drug_chunk,DRUG","matched_text,DRUG"])\
      .setCaseSensitive(False)

ner_jsl download started this may take some time.
[OK!]
ner_posology download started this may take some time.
[OK!]
drug_matcher download started this may take some time.
[OK!]


**Pipeline Stages**

In [None]:
granular_pipeline = nlp.Pipeline(stages=[
    document_assambler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    ner_jsl,
    jsl_converter_internal,
    ner_posology,
    ner_converter_internal,
    text_matcher,
    chunk_merge
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

granular_model = granular_pipeline.fit(empty_data)


In [None]:
text = """John Smith, a 55-year-old male with a medical history of hypertension, Type 2 Diabetes Mellitus, Hyperlipidemia, Gastroesophageal Reflux Disease (GERD), and chronic constipation, presented with persistent epigastric pain, heartburn, and infrequent bowel movements. He described the epigastric pain as burning and worsening after meals, often accompanied by heartburn and regurgitation, particularly when lying down. Additionally, he reported discomfort and bloating associated with infrequent bowel movements. In response, his doctor prescribed a regimen tailored to his conditions: Thiamine 100 mg q.day , Folic acid 1 mg q.day , multivitamins q.day , Calcium carbonate plus Vitamin D 250 mg t.i.d. , Heparin 5000 units subcutaneously b.i.d. , Prilosec 20 mg q.day , Senna two tabs qhs . The patient was advised to follow a low-fat diet, avoid spicy and acidic foods, and elevate the head of the bed to alleviate GERD symptoms. Lifestyle modifications including regular exercise, smoking cessation, and moderation in alcohol consumption were recommended to manage his chronic conditions effectively. A follow-up appointment in two weeks was scheduled."""


Now we will transform our data on the model and get the results.

In [None]:
data = spark.createDataFrame([[text]]).toDF("text")
result = granular_model.transform(data)

After the transform stage, here are the columns in our result dataframe.

In [None]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|             ner_jsl|      jsl_drug_chunk|        ner_posology|      posology_chunk|        matched_text|    medication_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|John Smith, a 55-...|[{document, 0, 11...|[{document, 0, 26...|[{token, 0, 3, Jo...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 583, 590...|[{named_entity, 0...|[{chunk, 583, 590...|[{chunk, 583, 597...|[{chunk, 583, 590...|
+--------------------+--------------------+--------------------+----

You can check all the NER model outputs on the same table as shown below.

In [None]:
# Apply transformations and select necessary columns
granular_df = (result.select(
    F.explode(
        F.arrays_zip(
            result.medication_chunk.result,
            result.medication_chunk.metadata,
            result.jsl_drug_chunk.result,
            result.jsl_drug_chunk.metadata,
            result.posology_chunk.result,
            result.posology_chunk.metadata,
            result.matched_text.result,
            result.matched_text.metadata,
        )
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("medication_chunk"),
    F.expr("cols['1']['entity']").alias("ner_label"),
    F.expr("cols['1']['ner_source']").alias("ner_source"),
    F.expr("cols['2']").alias("jsl_drug_chunk"),
    F.expr("cols['3']['entity']").alias("jsl_drug_ner_label"),
    F.expr("cols['4']").alias("posology_chunk"),
    F.expr("cols['5']['entity']").alias("posology_ner_label"),
    F.expr("cols['6']").alias("matched_text"),
    F.expr("cols['7']['entity']").alias("matched_text_ner_label"),
).toPandas())
granular_df.fillna('-', inplace=True)

granular_df

Unnamed: 0,medication_chunk,ner_label,ner_source,jsl_drug_chunk,jsl_drug_ner_label,posology_chunk,posology_ner_label,matched_text,matched_text_ner_label
0,Thiamine,DRUG,posology_chunk,Thiamine,DRUG,Thiamine,DRUG,Thiamine 100 mg,DRUG
1,100 mg,STRENGTH,posology_chunk,100 mg,STRENGTH,100 mg,STRENGTH,Folic acid 1 mg,DRUG
2,q.day,FREQUENCY,posology_chunk,q.day,FREQUENCY,q.day,FREQUENCY,Calcium carbonate,DRUG
3,Folic acid,DRUG,posology_chunk,Folic acid,DRUG,Folic acid,DRUG,Heparin,DRUG
4,1 mg,STRENGTH,posology_chunk,1 mg,STRENGTH,1 mg,STRENGTH,Prilosec 20 mg,DRUG
5,q.day,FREQUENCY,posology_chunk,q.day,FREQUENCY,q.day,FREQUENCY,-,-
6,multivitamins,DRUG,posology_chunk,multivitamins,DRUG,multivitamins,DRUG,-,-
7,q.day,FREQUENCY,posology_chunk,q.day,FREQUENCY,q.day,FREQUENCY,-,-
8,Calcium carbonate,DRUG,posology_chunk,Calcium carbonate,DRUG,Calcium carbonate,DRUG,-,-
9,Vitamin D,DRUG,posology_chunk,Vitamin D,DRUG,Vitamin D,DRUG,-,-


Let's check the merged chunk output.

In [None]:
medication_df = result.select(F.explode(F.arrays_zip(result.medication_chunk.result,
                                     result.medication_chunk.metadata,
                                     result.medication_chunk.begin,
                                     result.medication_chunk.end
                                    )
                       ).alias("cols")) \
      .select(F.expr("cols['1']['sentence']").alias("sentence_id"),
              F.expr("cols['2']").alias("begin"),
              F.expr("cols['3']").alias("end"),
              F.expr("cols['0']").alias("medication_chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['ner_source']").alias("ner_source")
             ).toPandas()

medication_df

Unnamed: 0,sentence_id,begin,end,medication_chunk,ner_label,ner_source
0,4,583,590,Thiamine,DRUG,posology_chunk
1,4,592,597,100 mg,STRENGTH,posology_chunk
2,4,599,603,q.day,FREQUENCY,posology_chunk
3,4,607,616,Folic acid,DRUG,posology_chunk
4,4,618,621,1 mg,STRENGTH,posology_chunk
5,4,623,627,q.day,FREQUENCY,posology_chunk
6,4,631,643,multivitamins,DRUG,posology_chunk
7,4,645,649,q.day,FREQUENCY,posology_chunk
8,4,653,669,Calcium carbonate,DRUG,posology_chunk
9,4,676,684,Vitamin D,DRUG,posology_chunk


Before running the whole data on the model, we can use Spark NLP `LightPipeline` for checking the results faster on a small data. Then we can update the pipeline if it is needed.

We will visualize the results using `sparknlp_display` library and be able to check the results on the document.

In [None]:
from sparknlp_display import NerVisualizer

light_model = nlp.LightPipeline(granular_model)

light_result = light_model.fullAnnotate(text)

visualiser = NerVisualizer()

visualiser.display(light_result[0], label_col='medication_chunk', document_col='document', save_path="display_bert_result.html")

We've already created a pretrained pipeline, [ner_medication_pipeline](https://nlp.johnsnowlabs.com/2024/03/22/ner_medication_pipeline_en.html)  , integrating all these components. You can use this pretrained pipeline with a single line of code.
Please refer to the example located at the end of this notebook to see how pretrained pipelines are easy to use.

## Generic Medication NER Models

If `DRUG` and other entities are sequential, we can use "greedy" models to see these entities as a single chunk labelled as `DRUG`. Let's show an example for this.

**NER Models**

In [7]:
ner_jsl_greedy = medical.NerModel.pretrained("ner_jsl_greedy","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_jsl_greedy")\
    .setLabelCasing("upper")

ner_jsl_greedy_converter_internal = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner_jsl_greedy"])\
    .setOutputCol("jsl_greedy_drug_chunk")\
    .setWhiteList(['DRUG_INGREDIENT', 'DRUG_BRANDNAME', 'DRUG'])\
    .setReplaceLabels({"DRUG_INGREDIENT" : "DRUG", "DRUG_BRANDNAME" : "DRUG"})


ner_posology_greedy = medical.NerModel.pretrained("ner_posology_greedy","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_posology_greedy")\

posology_converter_internal = medical.NerConverterInternal()\
    .setInputCols(["sentence","token", "ner_posology_greedy"])\
    .setOutputCol("posology_greedy_chunk")\
    .setWhiteList(["DRUG"])

drugs_large = medical.NerModel.pretrained("ner_drugs_large", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("drugs_large")

drugs_large_converter = medical.NerConverter() \
    .setInputCols(["sentence", "token", "drugs_large"]) \
    .setOutputCol("drugs_large_chunk")

text_matcher = medical.TextMatcherModel.pretrained("drug_matcher","en","clinical/models") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("matched_text")\
    .setCaseSensitive(False)\
    .setDelimiter("#")

chunk_merge_greedy = medical.ChunkMergeApproach()\
    .setInputCols("jsl_greedy_drug_chunk", "posology_greedy_chunk", "drugs_large_chunk", "matched_text")\
    .setOutputCol("medication_greedy_chunk")\
    .setOrderingFeatures(["ChunkPrecedence"])\
    .setChunkPrecedence('ner_source,entity')\
    .setChunkPrecedenceValuePrioritization(["posology_greedy_chunk,DRUG","jsl_greedy_drug_chunk,DRUG", "drugs_large_chunk,DRUG","matched_text,DRUG"])\
    .setCaseSensitive(False)


ner_jsl_greedy download started this may take some time.
[OK!]
ner_posology_greedy download started this may take some time.
[OK!]
ner_drugs_large download started this may take some time.
[OK!]
drug_matcher download started this may take some time.
[OK!]


We've already created a pretrained pipeline, [ner_medication_generic_pipeline](https://nlp.johnsnowlabs.com/2021/03/31/ner_medication_generic_pipeline.html) , integrating all these components. You can use this pretrained pipeline with a single line of code. Please refer to the example located at the end of this notebook.

**Pipeline Stages**

In [8]:
generic_pipeline = nlp.Pipeline(stages=[
    document_assambler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    ner_jsl_greedy,
    ner_jsl_greedy_converter_internal,
    ner_posology_greedy,
    posology_converter_internal,
    drugs_large,
    drugs_large_converter,
    text_matcher,
    chunk_merge_greedy
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

generic_model = generic_pipeline.fit(empty_data)

In [None]:
text = """The patient described the epigastric pain as burning and worsening after meals, often accompanied by heartburn and regurgitation, particularly when lying down.
Additionally, he reported discomfort and bloating associated with infrequent bowel movements. In response, his doctor prescribed a regimen tailored to his conditions:
Thiamine 100 mg , Folic acid 1 mg , multivitamins , Calcium carbonate plus Vitamin D 250 mg , Heparin 5000 units subcutaneously , Prilosec 20 mg , Senna two tabs ."""

Let's transform our data on the model and get the results.

In [None]:
df = spark.createDataFrame([[text]]).toDF("text")
result = generic_model.transform(df)

After the transform stage, here are the columns in our result dataframe.

In [None]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------------+--------------------+---------------------+--------------------+--------------------+--------------------+-----------------------+
|                text|            document|            sentence|               token|          embeddings|      ner_jsl_greedy|jsl_greedy_drug_chunk| ner_posology_greedy|posology_greedy_chunk|         drugs_large|   drugs_large_chunk|        matched_text|medication_greedy_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------------+--------------------+---------------------+--------------------+--------------------+--------------------+-----------------------+
|The patient descr...|[{document, 0, 48...|[{document, 0, 15...|[{token, 0, 2, Th...|[{word_embeddings...|[{named_entity, 0...| [{chunk, 327, 341...|[{named_ent

Checking the greedy model outputs on the same table.

In [None]:
# Apply transformations and select necessary columns
generic_df = (result.select(
    F.explode(
        F.arrays_zip(
            result.medication_greedy_chunk.result,
            result.medication_greedy_chunk.metadata,
            result.jsl_greedy_drug_chunk.result,
            result.jsl_greedy_drug_chunk.metadata,
            result.posology_greedy_chunk.result,
            result.posology_greedy_chunk.metadata,
            result.drugs_large_chunk.result,
            result.drugs_large_chunk.metadata,
            result.matched_text.result,
            result.matched_text.metadata,
        )
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("medication_greedy_chunk"),
    F.expr("cols['1']['entity']").alias("ner_label"),
    F.expr("cols['1']['ner_source']").alias("ner_source"),
    F.expr("cols['2']").alias("ner_jsl_greedy_chunk"),
    F.expr("cols['3']['entity']").alias("ner_jsl_greedy_label"),
    F.expr("cols['4']").alias("posology_greedy_chunk"),
    F.expr("cols['5']['entity']").alias("posology_greedy_label"),
    F.expr("cols['6']").alias("drugs_large_chunk"),
    F.expr("cols['7']['entity']").alias("drugs_large_label"),
    F.expr("cols['8']").alias("matched_text"),
    F.expr("cols['9']['entity']").alias("matched_text_ner_label"),
).toPandas())

generic_df.fillna('-', inplace=True)

generic_df

Unnamed: 0,medication_greedy_chunk,ner_label,ner_source,ner_jsl_greedy_chunk,ner_jsl_greedy_label,posology_greedy_chunk,posology_greedy_label,drugs_large_chunk,drugs_large_label,matched_text,matched_text_ner_label
0,Thiamine 100 mg,DRUG,posology_greedy_chunk,Thiamine 100 mg,DRUG,Thiamine 100 mg,DRUG,Thiamine 100 mg,DRUG,Thiamine 100 mg,DRUG
1,Folic acid 1 mg,DRUG,posology_greedy_chunk,Folic acid 1 mg,DRUG,Folic acid 1 mg,DRUG,Folic acid 1 mg,DRUG,Folic acid 1 mg,DRUG
2,multivitamins,DRUG,posology_greedy_chunk,multivitamins,DRUG,multivitamins,DRUG,multivitamins,DRUG,Calcium carbonate,DRUG
3,Calcium carbonate,DRUG,posology_greedy_chunk,Calcium carbonate,DRUG,Calcium carbonate,DRUG,Calcium carbonate,DRUG,Heparin,DRUG
4,Vitamin D 250 mg,DRUG,posology_greedy_chunk,Vitamin D 250 mg,DRUG,Vitamin D 250 mg,DRUG,Vitamin D 250 mg,DRUG,Prilosec 20 mg,DRUG
5,Heparin 5000 units subcutaneously,DRUG,posology_greedy_chunk,Heparin 5000 units subcutaneously,DRUG,Heparin 5000 units subcutaneously,DRUG,Heparin 5000 units subcutaneously,DRUG,-,-
6,Prilosec 20 mg,DRUG,posology_greedy_chunk,Prilosec 20 mg,DRUG,Prilosec 20 mg,DRUG,Prilosec 20 mg,DRUG,-,-
7,Senna two tabs,DRUG,posology_greedy_chunk,Senna two tabs,DRUG,Senna two tabs,DRUG,Senna two tabs .,DRUG,-,-


Merged chunk output is shown below.

In [None]:
medication_greedy_df = result.select(F.explode(F.arrays_zip(result.medication_greedy_chunk.result,
                                     result.medication_greedy_chunk.metadata,
                                     result.medication_greedy_chunk.begin,
                                     result.medication_greedy_chunk.end
                                    )
                       ).alias("cols")) \
      .select(F.expr("cols['1']['sentence']").alias("sentence_id"),
              F.expr("cols['2']").alias("begin"),
              F.expr("cols['3']").alias("end"),
              F.expr("cols['0']").alias("medication_greedy_chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['ner_source']").alias("ner_source")
             ).toPandas()
medication_greedy_df

Unnamed: 0,sentence_id,begin,end,medication_greedy_chunk,ner_label,ner_source
0,3,327,341,Thiamine 100 mg,DRUG,posology_greedy_chunk
1,3,345,359,Folic acid 1 mg,DRUG,posology_greedy_chunk
2,3,363,375,multivitamins,DRUG,posology_greedy_chunk
3,3,379,395,Calcium carbonate,DRUG,posology_greedy_chunk
4,3,402,417,Vitamin D 250 mg,DRUG,posology_greedy_chunk
5,3,421,453,Heparin 5000 units subcutaneously,DRUG,posology_greedy_chunk
6,3,457,470,Prilosec 20 mg,DRUG,posology_greedy_chunk
7,3,474,487,Senna two tabs,DRUG,posology_greedy_chunk


Now, we will visualize the greedy merged chunk.

In [None]:
greedy_light_model = nlp.LightPipeline(generic_model)

greedy_light_result = greedy_light_model.fullAnnotate(text)

from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

visualiser.display(greedy_light_result[0], label_col='medication_greedy_chunk', document_col='document', save_path="display_bert_result.html")

## Comparison of Medication and Medication-Greedy Results

This section illustrates the distinction between standard pre-trained models and greedy models. You can see  greedy model chunks DRUG, DOSAGE, ROUTE and STRENGTH entities together into a larger entity as DRUG when they appear together.

In [None]:
from google.colab import widgets
from sparknlp_display import NerVisualizer
visualiser = NerVisualizer()

t = widgets.TabBar(["medication", "medication_greedy", "viz_medication", "viz_medication_greedy"])

with t.output_to(0):
    display(medication_df)

with t.output_to(1):
    display(medication_greedy_df)

with t.output_to(2):
    visualiser.display(light_result[0], label_col='medication_chunk', document_col='document')

with t.output_to(3):
    visualiser.display(greedy_light_result[0], label_col='medication_greedy_chunk', document_col='document')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,sentence_id,begin,end,medication_chunk,ner_label,ner_source
0,4,583,590,Thiamine,DRUG,posology_chunk
1,4,592,597,100 mg,STRENGTH,posology_chunk
2,4,599,603,q.day,FREQUENCY,posology_chunk
3,4,607,616,Folic acid,DRUG,posology_chunk
4,4,618,621,1 mg,STRENGTH,posology_chunk
5,4,623,627,q.day,FREQUENCY,posology_chunk
6,4,631,643,multivitamins,DRUG,posology_chunk
7,4,645,649,q.day,FREQUENCY,posology_chunk
8,4,653,669,Calcium carbonate,DRUG,posology_chunk
9,4,676,684,Vitamin D,DRUG,posology_chunk


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,sentence_id,begin,end,medication_greedy_chunk,ner_label,ner_source
0,3,327,341,Thiamine 100 mg,DRUG,posology_greedy_chunk
1,3,345,359,Folic acid 1 mg,DRUG,posology_greedy_chunk
2,3,363,375,multivitamins,DRUG,posology_greedy_chunk
3,3,379,395,Calcium carbonate,DRUG,posology_greedy_chunk
4,3,402,417,Vitamin D 250 mg,DRUG,posology_greedy_chunk
5,3,421,453,Heparin 5000 units subcutaneously,DRUG,posology_greedy_chunk
6,3,457,470,Prilosec 20 mg,DRUG,posology_greedy_chunk
7,3,474,487,Senna two tabs,DRUG,posology_greedy_chunk


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Relation Extraction

This section is a demonstration of extracting medication related relations by using pre-trained [posology_re](https://nlp.johnsnowlabs.com/2020/09/01/posology_re.html) Relation Extraction model.

The following relations are supported:

DRUG-DOSAGE,
DRUG-FREQUENCY,
DRUG-ADE (Adversed Drug Events),
DRUG-FORM,
DRUG-ROUTE,
DRUG-DURATION,
DRUG-REASON,
DRUG-STRENGTH

The `posology_re` has been validated against the posology dataset described in (Magge, Scotch, & Gonzalez-Hernandez, 2018).

| Relation | Recall | Precision | F1 | F1 (Magge, Scotch, & Gonzalez-Hernandez, 2018) |
| --- | --- | --- | --- | --- |
| DRUG-ADE | 0.66 | 1.00 | **0.80** | 0.76 |
| DRUG-DOSAGE | 0.89 | 1.00 | **0.94** | 0.91 |
| DRUG-DURATION | 0.75 | 1.00 | **0.85** | 0.92 |
| DRUG-FORM | 0.88 | 1.00 | **0.94** | 0.95* |
| DRUG-FREQUENCY | 0.79 | 1.00 | **0.88** | 0.90 |
| DRUG-REASON | 0.60 | 1.00 | **0.75** | 0.70 |
| DRUG-ROUTE | 0.79 | 1.00 | **0.88** | 0.95* |
| DRUG-STRENGTH | 0.95 | 1.00 | **0.98** | 0.97 |


*Magge, Scotch, Gonzalez-Hernandez (2018) collapsed DRUG-FORM and DRUG-ROUTE into a single relation.


You can check [Clinical Relation Extraction Notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb) for more details.

###Relation Extraction Model

In this part, the relation extration model optimized for posology is used in order to get relations across entities.

The precision of the RE model is controlled by "setMaxSyntacticDistance(4)", which sets the maximum syntactic distance between named entities to 4. A larger value will improve recall at the expense at lower precision. A value of 4 leads to literally perfect precision (i.e. the model doesn't produce any false positives) and reasonably good recall.

Also, you can adjust which relations you want to see as output with setting "setRelationPairs()"

In [None]:
# RE
pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos_tags")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos_tags", "token"])\
    .setOutputCol("dependencies")

reModel = medical.RelationExtractionModel()\
    .pretrained("posology_re")\
    .setInputCols(["embeddings", "pos_tags", "medication_chunk", "dependencies"])\
    .setOutputCol("relations")\
    .setMaxSyntacticDistance(4)\
    #.setRelationPairs(["DRUG-DURATION","DRUG-FREQUENCY", "DRUG-STRENGTH"])\


pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]


##Pipeline Stages

In [None]:
re_pipeline = nlp.Pipeline(stages=[
    document_assambler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    ner_jsl,
    jsl_converter_internal,
    ner_posology,
    ner_converter_internal,
    text_matcher,
    chunk_merge,
    pos_tagger,
    dependency_parser,
    reModel
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

re_model = re_pipeline.fit(empty_data)

In [None]:
text = """
The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also given 1 unit of Metformin daily.
She was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night ,
12 units of insulin lispro with meals , and metformin 1000 mg two times a day.
"""

Let's check the results on the `.transform()` method:

In [None]:
data = spark.createDataFrame([[text]]).toDF("text")

result = re_model.transform(data)
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|             ner_jsl|      jsl_drug_chunk|        ner_posology|      posology_chunk|        matched_text|    medication_chunk|            pos_tags|        dependencies|           relations|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|\nThe patient was...|[{document, 0, 30...|[{document, 1, 66...|[{token, 1, 3, Th...|[{word_embeddings...|[{name

Let's see the results with `explode()` function:

In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.relations.result,
                                                 result.relations.metadata)).alias("cols")) \
                  .select(
                          F.expr("cols['1']['sentence']").alias("sentence"),\
                          F.expr("cols['1']['entity1_begin']").alias("entity1_begin"),\
                          F.expr("cols['1']['entity1_end']").alias("entity1_end"),\
                          F.expr("cols['1']['chunk1']" ).alias("chunk1" ),\
                          F.expr("cols['1']['entity1']").alias("entity1"),\
                          F.expr("cols['1']['entity2_begin']").alias("entity2_begin"),\
                          F.expr("cols['1']['entity2_end']").alias("entity2_end"),\
                          F.expr("cols['1']['chunk2']" ).alias("chunk2"),\
                          F.expr("cols['1']['entity2']").alias("entity2"),\
                          F.expr("cols['0']").alias("relation"),\
                          F.expr("cols['1']['confidence']").alias("confidence"),\
                          )

result_df.show()

+--------+-------------+-----------+----------------+-------+-------------+-----------+----------------+---------+--------------+----------+
|sentence|entity1_begin|entity1_end|          chunk1|entity1|entity2_begin|entity2_end|          chunk2|  entity2|      relation|confidence|
+--------+-------------+-----------+----------------+-------+-------------+-----------+----------------+---------+--------------+----------+
|       0|           28|         33|          1 unit| DOSAGE|           38|         42|           Advil|     DRUG|   DOSAGE-DRUG|       1.0|
|       0|           38|         42|           Advil|   DRUG|           44|         53|      for 5 days| DURATION| DRUG-DURATION|       1.0|
|       1|           95|        100|          1 unit| DOSAGE|          105|        113|       Metformin|     DRUG|   DOSAGE-DRUG|       1.0|
|       1|          105|        113|       Metformin|   DRUG|          115|        119|           daily|FREQUENCY|DRUG-FREQUENCY|       1.0|
|       2|   

We can also check the results with the `LightPipeline`:

In [None]:
lmodel = nlp.LightPipeline(re_model)

results = lmodel.fullAnnotate(text)

In [None]:
results[0].keys()

dict_keys(['medication_chunk', 'document', 'matched_text', 'token', 'relations', 'jsl_drug_chunk', 'posology_chunk', 'embeddings', 'pos_tags', 'ner_jsl', 'dependencies', 'ner_posology', 'sentence'])

In [None]:
results[0]['relations']

[Annotation(category, 28, 42, DOSAGE-DRUG, {'chunk2': 'Advil', 'confidence': '1.0', 'entity2_end': '42', 'chunk1': '1 unit', 'entity1': 'DOSAGE', 'entity2_begin': '38', 'chunk2_confidence': '0.9984', 'entity1_begin': '28', 'sentence': '0', 'direction': 'both', 'entity1_end': '33', 'entity2': 'DRUG', 'chunk1_confidence': '0.71675'}, []),
 Annotation(category, 38, 53, DRUG-DURATION, {'chunk2': 'for 5 days', 'confidence': '1.0', 'entity2_end': '53', 'chunk1': 'Advil', 'entity1': 'DRUG', 'entity2_begin': '44', 'chunk2_confidence': '0.7455', 'entity1_begin': '38', 'sentence': '0', 'direction': 'both', 'entity1_end': '42', 'entity2': 'DURATION', 'chunk1_confidence': '0.9984'}, []),
 Annotation(category, 95, 113, DOSAGE-DRUG, {'chunk2': 'Metformin', 'confidence': '1.0', 'entity2_end': '113', 'chunk1': '1 unit', 'entity1': 'DOSAGE', 'entity2_begin': '105', 'chunk2_confidence': '0.9998', 'entity1_begin': '95', 'sentence': '1', 'direction': 'both', 'entity1_end': '100', 'entity2': 'DRUG', 'chunk

We can see extracted relations:

In [None]:
for rel in results[0]["relations"]:
    print("{}({}={} - {}={})".format(
        rel.result,
        rel.metadata['entity1'],
        rel.metadata['chunk1'],
        rel.metadata['entity2'],
        rel.metadata['chunk2']
    ))

DOSAGE-DRUG(DOSAGE=1 unit - DRUG=Advil)
DRUG-DURATION(DRUG=Advil - DURATION=for 5 days)
DOSAGE-DRUG(DOSAGE=1 unit - DRUG=Metformin)
DRUG-FREQUENCY(DRUG=Metformin - FREQUENCY=daily)
DOSAGE-DRUG(DOSAGE=40 units - DRUG=insulin glargine)
DRUG-FREQUENCY(DRUG=insulin glargine - FREQUENCY=at night)
DOSAGE-DRUG(DOSAGE=12 units - DRUG=insulin lispro)
DRUG-FREQUENCY(DRUG=insulin lispro - FREQUENCY=with meals)
DRUG-STRENGTH(DRUG=metformin - STRENGTH=1000 mg)
DRUG-FREQUENCY(DRUG=metformin - FREQUENCY=two times a day)


We can create a pandas DF to check the results on an easy to read table:

In [None]:
rel_data = [
    {
        'sentence': rel.metadata['sentence'],
        'entity1_begin': rel.metadata['entity1_begin'],
        'entity1_end': rel.metadata['entity1_end'],
        'chunk1': rel.metadata['chunk1'],
        'entity1': rel.metadata['entity1'],
        'entity2_begin': rel.metadata['entity2_begin'],
        'entity2_end': rel.metadata['entity2_end'],
        'chunk2': rel.metadata['chunk2'],
        'entity2': rel.metadata['entity2'],
        'relation': rel.result,
        'confidence': rel.metadata['confidence']
    }
    for rel in results[0]['relations']
]
rel_df = pd.DataFrame(rel_data)

chunk_data = [
    {
        'begin': str(chunk.begin),
        'end': str(chunk.end),
        'chunk': chunk.result
    }
    for chunk in results[0]['medication_chunk']
]
chunks_df = pd.DataFrame(chunk_data)

result_df = pd.merge(
    rel_df, chunks_df,
    left_on=["entity1_begin", "entity1_end", "chunk1"],
    right_on=["begin", "end", "chunk"]
)[rel_df.columns]

result_df

Unnamed: 0,sentence,entity1_begin,entity1_end,chunk1,entity1,entity2_begin,entity2_end,chunk2,entity2,relation,confidence
0,0,28,33,1 unit,DOSAGE,38,42,Advil,DRUG,DOSAGE-DRUG,1.0
1,0,38,42,Advil,DRUG,44,53,for 5 days,DURATION,DRUG-DURATION,1.0
2,1,95,100,1 unit,DOSAGE,105,113,Metformin,DRUG,DOSAGE-DRUG,1.0
3,1,105,113,Metformin,DRUG,115,119,daily,FREQUENCY,DRUG-FREQUENCY,1.0
4,2,190,197,40 units,DOSAGE,202,217,insulin glargine,DRUG,DOSAGE-DRUG,1.0
5,2,202,217,insulin glargine,DRUG,219,226,at night,FREQUENCY,DRUG-FREQUENCY,1.0
6,2,230,237,12 units,DOSAGE,242,255,insulin lispro,DRUG,DOSAGE-DRUG,1.0
7,2,242,255,insulin lispro,DRUG,257,266,with meals,FREQUENCY,DRUG-FREQUENCY,1.0
8,2,274,282,metformin,DRUG,284,290,1000 mg,STRENGTH,DRUG-STRENGTH,1.0
9,2,274,282,metformin,DRUG,292,306,two times a day,FREQUENCY,DRUG-FREQUENCY,1.0


##Relation Extraction Visualization

You can use `sparknlp_display` library to check the relations visually on the documents.

In [None]:
from sparknlp_display import RelationExtractionVisualizer

vis = RelationExtractionVisualizer()
vis.display(results[0], 'relations', show_relations=True) # default show_relations: True


# Assertion

This section demonstrates how assertion status is applied to clinical entities extracted, considering their context within the text.

You can check [Clinical Assertion Notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb) for more details.

### Assertion Model

In [None]:
chunk_merge_assertion = medical.ChunkMergeApproach()\
    .setInputCols("medication_greedy_chunk")\
    .setOutputCol("assertion_chunk")\
    .setWhiteList(["DRUG"]) # List of NER labels for assertion

clinical_assertion = medical.AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
    .setInputCols(["sentence", "assertion_chunk", "embeddings"]) \
    .setOutputCol("assertion")\

assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence","medication_greedy_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)

assertion_dl download started this may take some time.
[OK!]


### Pipeline Stages

In [None]:
medication_assertion_pipeline = nlp.Pipeline(stages=[
    document_assambler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    ner_jsl_greedy,
    ner_jsl_greedy_converter_internal,
    ner_posology_greedy,
    posology_converter_internal,
    drugs_large,
    drugs_large_converter,
    text_matcher,
    chunk_merge_greedy,
    chunk_merge_assertion,
    clinical_assertion,
    assertion_filterer
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

medication_assertion_model = medication_assertion_pipeline.fit(empty_data)


In [None]:
text = """The patient was prescribed Advil in case of feeling pain. The patient was also given 1 unit of Metformin daily.She was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day . """

Let's transform our data on the model and get the results.

In [None]:
data = spark.createDataFrame([[text]]).toDF("text")

result_assertion = medication_assertion_model.transform(data)

After the transform stage, here are the columns in our result dataframe.

In [None]:
result_assertion.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------------+--------------------+---------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|      ner_jsl_greedy|jsl_greedy_drug_chunk| ner_posology_greedy|posology_greedy_chunk|         drugs_large|   drugs_large_chunk|        matched_text|medication_greedy_chunk|     assertion_chunk|           assertion|  assertion_filtered|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------------+--------------------+---------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------

Let's see the results with `explode()` function:

In [None]:
result_assertion_df = result_assertion.select(F.explode(F.arrays_zip(result_assertion.medication_greedy_chunk.result,
                                                                     result_assertion.medication_greedy_chunk.begin,
                                                                     result_assertion.medication_greedy_chunk.end,
                                                                     result_assertion.medication_greedy_chunk.metadata,
                                                                     result_assertion.assertion.result)).alias("cols")) \
                  .select(
                          F.expr("cols['3']['sentence']").alias("sentence_id"),\
                          F.expr("cols['0']").alias("chunks"),\
                          F.expr("cols['1']").alias("begin"),\
                          F.expr("cols['2']").alias("end"),\
                          F.expr("cols['3']['entity']").alias("entities"),\
                          F.expr("cols['4']").alias("assertion"),\
                          F.expr("cols['3']['confidence']").alias("confidence"),\
                          )

result_assertion_df.show(truncate=False)

+-----------+----------------------------+-----+---+--------+-----------+----------+
|sentence_id|chunks                      |begin|end|entities|assertion  |confidence|
+-----------+----------------------------+-----+---+--------+-----------+----------+
|0          |Advil                       |27   |31 |DRUG    |conditional|0.9135    |
|1          |1 unit of Metformin         |85   |103|DRUG    |present    |0.69277495|
|2          |40 units of insulin glargine|179  |206|DRUG    |present    |0.62600005|
|2          |12 units of insulin lispro  |219  |244|DRUG    |present    |0.6605    |
|2          |metformin 1000 mg           |263  |279|DRUG    |present    |0.7073    |
+-----------+----------------------------+-----+---+--------+-----------+----------+



We can also check the results with the `LightPipeline`:

In [None]:
lmodel = nlp.LightPipeline(medication_assertion_model)

light_result = lmodel.fullAnnotate(text)

In [None]:
light_result[0].keys()

dict_keys(['assertion_filtered', 'ner_posology_greedy', 'ner_jsl_greedy', 'document', 'matched_text', 'assertion', 'drugs_large', 'jsl_greedy_drug_chunk', 'drugs_large_chunk', 'posology_greedy_chunk', 'token', 'medication_greedy_chunk', 'embeddings', 'assertion_chunk', 'sentence'])

In [None]:
ner_chunk = []
ner_label = []
begin = []
end = []
assertion = []
confidence=[]

for n,m in zip(light_result[0]['assertion_chunk'],light_result[0]['assertion']):

    ner_chunk.append(n.result)
    begin.append(n.begin)
    end.append(n.end)
    ner_label.append(n.metadata['entity'])
    assertion.append(m.result)
    confidence.append(m.metadata['confidence'])



import pandas as pd

df = pd.DataFrame({'ner_chunk':ner_chunk, 'begin': begin, 'end':end,
                   'ner_label':ner_label, 'confidence':confidence, 'assertion':assertion })

df

Unnamed: 0,ner_chunk,begin,end,ner_label,confidence,assertion
0,Advil,27,31,DRUG,0.6474,conditional
1,1 unit of Metformin,85,103,DRUG,0.9986,present
2,40 units of insulin glargine,179,206,DRUG,1.0,present
3,12 units of insulin lispro,219,244,DRUG,0.9999,present
4,metformin 1000 mg,263,279,DRUG,0.9998,present


### Assertion Visualization

We can visualize the assertion model results using `sparknlp_display` library.

In [None]:
from sparknlp_display import AssertionVisualizer

vis = nlp.viz.AssertionVisualizer()

vis.display(light_result[0], 'assertion_chunk', 'assertion')

# Mappers and Resolvers


**Drug Action Treatment Mapper**

Pretrained `drug_action_treatment_mapper` model maps drugs with their corresponding `action` and `treatment` through `ChunkMapperModel()` annotator. **Action** of drug refers to the function of a drug in various body systems. **Treatment** refers to which disease the drug is used to treat.

We can choose which option we want to use by setting `setRels()` parameter of `ChunkMapperModel()`.
We will create a pipeline consisting `bert_token_classifier_drug_development_trials` ner model to extract ner chunk as well as `ChunkMapperModel()`. <br/>
 Also, we will set the `.setRels()` parameter with `action` and see the results.

**Drug Brand Name NDC Mapper**

The `drug_brandname_ndc_mapper` model that maps drug brand names to corresponding National Drug Codes (NDC). Product NDCs for each strength are returned in result and metadata. <br/>

It has one relation type called `Strength_NDC`

Let's create a pipeline with both mappers and see how it works.

In [18]:
#drug_action_treatment_mapper with "action" mappings, you can also try "treatment"
chunker_at_mapper= medical.ChunkMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["medication_chunk"])\
    .setOutputCol("action_mappings")\
    .setRels(["action"])

chunker_ndc_mapper = medical.ChunkMapperModel.pretrained("drug_brandname_ndc_mapper", "en", "clinical/models")\
    .setInputCols(["medication_chunk"])\
    .setOutputCol("ndc")\
    .setRel("Strength_NDC")\

mapper_pipeline = nlp.Pipeline(stages=[
    document_assambler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    ner_jsl,
    jsl_converter_internal,
    ner_posology,
    ner_converter_internal,
    text_matcher,
    chunk_merge,
    chunker_at_mapper,
    chunker_ndc_mapper
])


drug_action_treatment_mapper download started this may take some time.
[OK!]
drug_brandname_ndc_mapper download started this may take some time.
[OK!]


## Drug Action Treatment Mapper Results

Let's transform our data on the model and get the results.

In [None]:
text = [["""The patient was prescribed advil. She was seen by the endocrinology service and she was discharged on insulin glargine ."""]]

test_data = spark.createDataFrame(text).toDF("text")

mapper_model = mapper_pipeline.fit(test_data)

result = mapper_model.transform(test_data)

 The columns in result dataframe.

In [None]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|             ner_jsl|      jsl_drug_chunk|        ner_posology|      posology_chunk|        matched_text|    medication_chunk|     action_mappings|                 ndc|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|The patient was p...|[{document, 0, 11...|[{document, 0, 32...|[{token, 0, 2, Th...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 27, 31, ...|[{named_entity, 0...|[{chun

Chunks detected by ner model:

In [None]:
 result.select(F.explode('medication_chunk.result').alias("chunks")).show(truncate=False)

+----------------+
|chunks          |
+----------------+
|advil           |
|insulin glargine|
+----------------+



Checking mapping results:

In [None]:
result.select("action_mappings.result").show(truncate=False)

+----------------------------------+
|result                            |
+----------------------------------+
|[analgesic, drugs used in diabets]|
+----------------------------------+



Checking mapping metadata:

In [None]:
result.selectExpr("action_mappings.metadata").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

As you see above under the ***metadata*** column, if exist, we can see all the relations for each chunk. <br/>


Let's display the results in a more readable format.

In [None]:
result.select(F.explode(F.arrays_zip(result.medication_chunk.result,
                                  result.action_mappings.result,
                                  result.action_mappings.metadata)).alias("col"))\
    .select(F.expr("col['0']").alias("medication_chunk"),
            F.expr("col['1']").alias("mapping_result"),
            F.expr("col['2']['all_relations']").alias("all_relations")).show(truncate=False)

+----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|medication_chunk|mapping_result       |all_relations                                                                                                                                                                      |
+----------------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|advil           |analgesic            |anti-inflammatory:::antipyretic:::cardiac therapy:::decongestant:::local anesthetic:::nonsteroidal anti-inflammatory:::pain reliever:::topical products for joint and muscular pain|
|insulin glargine|drugs used in diabets|hypoglycemic                                                                

## Drug Brand Name NDC Mapper Results

In [None]:
result.select(F.explode(F.arrays_zip(result.medication_chunk.result,
                                  result.ndc.result,
                                  result.ndc.metadata)).alias("col"))\
    .select(F.expr("col['0']").alias("Brand_Name"),
            F.expr("col['1']").alias("Strength_NDC"),
            F.expr("col['2']['entity']").alias("entity")).filter("entity=='DRUG'").show(truncate=False)

+----------------+-----------------------+------+
|Brand_Name      |Strength_NDC           |entity|
+----------------+-----------------------+------+
|advil           |200 mg/1 | 0573-0166   |DRUG  |
|insulin glargine|100 [iU]/mL | 49502-394|DRUG  |
+----------------+-----------------------+------+



As you can observe, there are corresponding "NDC" mappings for each "brand name".

## RxNorm Resolver & NDC Mapper

In this section, by utilizing `medication_greedy_chunk`, we retrieve `rxnorm_code` of DRUGs.
`RxNorm` is a second vocabulary for DRUGs. RxNorm provides a set of codes for clinical DRUGs, which are the combination of active INGREDIENTS, DOSAGE, FORM, and STRENGTH of a DRUG.


### NDC Mapper

In the same pipeline with RxNorm resolver we can use pretrained `rxnorm_ndc_mapper` model that maps RxNorm and RxNorm Extension codes with corresponding National Drug Codes (NDC).

It has two relation types that can be defined in `setRel()` parameter; **Product NDC** and **Package NDC**

In [9]:
chunk2doc = nlp.Chunk2Doc()\
  .setInputCols("medication_greedy_chunk")\
  .setOutputCol("ner_chunk_doc")

sbert_embedder = nlp.BertSentenceEmbeddings\
    .pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
    .setInputCols(["ner_chunk_doc"])\
    .setOutputCol("sentence_embeddings")\
    .setCaseSensitive(False)

rxnorm_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("rxnorm_code")\
    .setDistanceFunction("EUCLIDEAN")

resolver2chunk = medical.Resolution2Chunk()\
    .setInputCols(["rxnorm_code"])\
    .setOutputCol("rxnorm2chunk")

chunker_mapper_ndc = medical.ChunkMapperModel.pretrained("rxnorm_ndc_mapper", "en", "clinical/models")\
    .setInputCols(["rxnorm2chunk"])\
    .setOutputCol("Product NDC")\
    .setOutputCol("ndc_mappings")\
    .setRels(["Product NDC"])

rxnorm_ndc_pipeline = nlp.Pipeline(
    stages = [
        document_assambler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner_jsl_greedy,
        ner_jsl_greedy_converter_internal,
        ner_posology_greedy,
        posology_converter_internal,
        drugs_large,
        drugs_large_converter,
        text_matcher,
        chunk_merge_greedy,
        chunk2doc,
        sbert_embedder,
        rxnorm_resolver,
        resolver2chunk,
        chunker_mapper_ndc])

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_rxnorm_augmented download started this may take some time.
[OK!]
rxnorm_ndc_mapper download started this may take some time.
[OK!]


In [10]:
empty_data = spark.createDataFrame([[""]]).toDF("text")
rxnorm_model = rxnorm_ndc_pipeline.fit(empty_data)

In [11]:
text = """The patient was prescribed Albuterol inhaler when needed . She was seen by the endocrinology service and she was discharged on Avandia 4 mg at nights ,
Coumadin 5 mg with meals , and Metformin 1000 mg two times a day and with a daily dose of Lisinopril 10 mg."""

Let's check the results with the `LightPipeline`.

In [12]:
rxnorm_lp = nlp.LightPipeline(rxnorm_model)

In [13]:
light_result = rxnorm_lp.fullAnnotate(text)

In [14]:
light_result[0].keys()

dict_keys(['ner_posology_greedy', 'ner_jsl_greedy', 'document', 'ndc_mappings', 'matched_text', 'drugs_large', 'sentence_embeddings', 'jsl_greedy_drug_chunk', 'drugs_large_chunk', 'posology_greedy_chunk', 'token', 'rxnorm_code', 'medication_greedy_chunk', 'embeddings', 'ner_chunk_doc', 'rxnorm2chunk', 'sentence'])

In [15]:
# Dictionary from the lists using list comprehensions
res = {
    'chunks': [chunk.result for chunk in light_result[0]['medication_greedy_chunk']],
    'begin': [chunk.begin for chunk in light_result[0]['medication_greedy_chunk']],
    'end': [chunk.end for chunk in light_result[0]['medication_greedy_chunk']],
    'code': [code.result for code in light_result[0]['rxnorm_code']],
    'resolution': [code.metadata['resolved_text'].split(':::') for code in light_result[0]['rxnorm_code']],
    'all_codes': [code.metadata['all_k_results'].split(':::') for code in light_result[0]['rxnorm_code']],
    'all_resolutions': [code.metadata['all_k_resolutions'].split(':::') for code in light_result[0]['rxnorm_code']],
    'all_distances': [code.metadata['all_k_distances'].split(':::') for code in light_result[0]['rxnorm_code']],
    'all_cosines': [code.metadata['all_k_cosine_distances'].split(':::') for code in light_result[0]['rxnorm_code']]
}

df = pd.DataFrame(res).replace('NONE', '-')

df

Unnamed: 0,chunks,begin,end,code,resolution,all_codes,all_resolutions,all_distances,all_cosines
0,Albuterol inhaler,27,43,745678,[albuterol metered dose inhaler [albuterol met...,"[745678, 2108226, 1154602, 2108233, 2108228, 1...",[albuterol metered dose inhaler [albuterol met...,"[4.9847, 5.1028, 5.4746, 5.7809, 6.2859, 6.394...","[0.0414, 0.0439, 0.0505, 0.0562, 0.0676, 0.068..."
1,Avandia 4 mg,127,138,261242,[rosiglitazone 4 MG Oral Tablet [Avandia]],"[261242, 810073, 153845, 1094008, 2123140, 136...","[rosiglitazone 4 MG Oral Tablet [Avandia], fes...","[0.0000, 4.7482, 5.0125, 5.2516, 5.4650, 5.488...","[0.0000, 0.0365, 0.0409, 0.0453, 0.0492, 0.049..."
2,Coumadin 5 mg,152,164,855333,[warfarin sodium 5 MG [Coumadin]],"[855333, 438740, 153692, 352120, 1036890, 1043...","[warfarin sodium 5 MG [Coumadin], coumarin 5 m...","[0.0000, 4.0885, 5.3065, 5.5132, 5.5336, 5.741...","[0.0000, 0.0287, 0.0479, 0.0518, 0.0525, 0.057..."
3,Metformin 1000 mg,183,199,316255,[metformin 1000 mg [metformin 1000 mg]],"[316255, 860999, 860997, 861014, 861004, 86100...","[metformin 1000 mg [metformin 1000 mg], metfor...","[0.0000, 5.2988, 5.9071, 6.3066, 6.5777, 6.662...","[0.0000, 0.0445, 0.0553, 0.0632, 0.0679, 0.070..."
4,Lisinopril 10 mg,242,257,314076,[lisinopril 10 MG Oral Tablet],"[314076, 567576, 565846, 389184, 563611, 32829...","[lisinopril 10 MG Oral Tablet, lisinopril 10 m...","[0.0000, 3.6543, 4.2783, 4.2805, 4.6016, 5.126...","[0.0000, 0.0234, 0.0325, 0.0315, 0.0363, 0.046..."


Let's check the results of NDC Mapper.

In [19]:
cols = ['medication_greedy_chunk','rxnorm_code','ndc_mappings']
res = {col : [a.result for a in light_result[0][col]] for col in cols}

pd.DataFrame(res).replace('NONE', '-')

Unnamed: 0,medication_greedy_chunk,rxnorm_code,ndc_mappings
0,Albuterol inhaler,745678,-
1,Avandia 4 mg,261242,00173-0835
2,Coumadin 5 mg,855333,-
3,Metformin 1000 mg,316255,-
4,Lisinopril 10 mg,314076,00093-1113


We can visualize the Resolver results using `sparknlp_display` library.

In [20]:
from sparknlp_display import EntityResolverVisualizer

visualiser = EntityResolverVisualizer()

# Change color of an entity label
visualiser.set_label_colors({'DRUG':'#008080'})

visualiser.display(light_result[0], 'medication_greedy_chunk', 'rxnorm_code')

# Creating Medication Pretrained Pipeline

This section demonstrates the process of creating a pretrained pipeline, saving it locally, and subsequently loading it from the local storage.

We can use greedy NER, Assertion and Resolver Models that has already been constructed to create a pretrained pipeline.

In [None]:
medication_pipeline = nlp.Pipeline(stages=[
    document_assambler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    ner_jsl_greedy,
    ner_jsl_greedy_converter_internal,
    ner_posology_greedy,
    posology_converter_internal,
    drugs_large,
    drugs_large_converter,
    text_matcher,
    chunk_merge_greedy,
    chunk_merge_assertion,
    clinical_assertion,
    assertion_filterer,
    chunk2doc ,
    sbert_embedder,
    rxnorm_resolver,
    resolver2chunk,
    chunker_mapper_ndc
    ])

In [None]:
empty_data = spark.createDataFrame([[""]]).toDF("text")

medication_model = medication_pipeline.fit(empty_data)

In [None]:
medication_model.stages

[DocumentAssembler_e55f5da65899,
 SentenceDetectorDLModel_6bafc4746ea5,
 REGEX_TOKENIZER_33919ff295d4,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_24fd3973d348,
 NER_CONVERTER_00be911133ba,
 MedicalNerModel_419f5f2e48fa,
 NER_CONVERTER_79a11cd396b4,
 MedicalNerModel_f87187b94332,
 NER_CONVERTER_3d167b157b3b,
 ENTITY_EXTRACTOR_991b113451e8,
 MERGE_717403ce6c9c,
 MERGE_3543162021e9,
 ASSERTION_DL_25881ab6309e,
 AssertionFilterer_e1f1488e2626,
 Chunk2Doc_a86603a1c96a,
 BERT_SENTENCE_EMBEDDINGS_0bee53f1b2cc,
 ENTITY_3feb01c2d233,
 Resolution2Chunk_70fcc7572cb6,
 CHUNKER-MAPPER_2d7b0e176787]

### Saving Pretrained Pipeline To Local

After fitting the pipeline, we can save it locally and then use it as a `PretrainedPipeline` whenever needed.

In [None]:
medication_model.write().overwrite().save("medication_pipeline")

### Loading Pretrained Pipeline From Local

In [None]:
from pyspark.ml import PipelineModel

When loading the PretrainedPipeline from local, you can use the `.from_disk` method. It can then be used as a `LightPipeline`.

In [None]:
from sparknlp.pretrained import PretrainedPipeline

ner_medication_pipe = PretrainedPipeline.from_disk('medication_pipeline')

In [None]:
text = """The patient was prescribed Albuterol inhaler when needed . She was seen by the endocrinology service and she was discharged on Avandia 4 mg at nights , Coumadin 5 mg with meals , and Metformin 1000 mg two times a day and with a daily dose of Lisinopril 10 mg."""

In [None]:
result = ner_medication_pipe.fullAnnotate(text)[0]

result.keys()

dict_keys(['assertion_filtered', 'ner_posology_greedy', 'ner_jsl_greedy', 'document', 'ndc_mappings', 'matched_text', 'assertion', 'drugs_large', 'sentence_embeddings', 'jsl_greedy_drug_chunk', 'drugs_large_chunk', 'posology_greedy_chunk', 'token', 'rxnorm_code', 'medication_greedy_chunk', 'embeddings', 'assertion_chunk', 'ner_chunk_doc', 'rxnorm2chunk', 'sentence'])

Let's see the results.

In [None]:
cols = [ 'medication_greedy_chunk','assertion','rxnorm_code','ndc_mappings']
res = {col : [a.result for a in result[col]] for col in cols}
res['entity'] = [a.metadata['entity'] for a in result['medication_greedy_chunk']]

pd.DataFrame(res).replace('NONE', '-')

Unnamed: 0,medication_greedy_chunk,assertion,rxnorm_code,ndc_mappings,entity
0,Albuterol inhaler,conditional,745678,-,DRUG
1,Avandia 4 mg,present,261242,00173-0835,DRUG
2,Coumadin 5 mg,present,855333,-,DRUG
3,Metformin 1000 mg,present,316255,-,DRUG
4,Lisinopril 10 mg,present,314076,00093-1113,DRUG


### Using Pretrained Pipeline With `.transform` Method

If you prefer to run your Spark dataframe using the pretrained pipeline rather than utilizing it as a LightPipeline, follow the steps outlined below.

In [None]:
data = spark.createDataFrame([[text]]).toDF("text")

pp_transform_result = ner_medication_pipe.transform(data)

In [None]:
pp_transform_result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------------+--------------------+---------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|      ner_jsl_greedy|jsl_greedy_drug_chunk| ner_posology_greedy|posology_greedy_chunk|         drugs_large|   drugs_large_chunk|        matched_text|medication_greedy_chunk|     assertion_chunk|           assertion|  assertion_filtered|       ner_chunk_doc| sentence_embeddings|         rxnorm_code|        rxnorm2chunk|        ndc_mappings|
+--------------------+--------------------+--------------------+--------------------+-------------------

Let's see the results.

In [None]:
# Apply transformations and select necessary columns
pp_transform_df = (pp_transform_result.select(
    F.explode(
        F.arrays_zip(
            pp_transform_result.medication_greedy_chunk.result,
            pp_transform_result.medication_greedy_chunk.metadata,
            pp_transform_result.assertion.result,
            pp_transform_result.rxnorm_code.result,
            pp_transform_result.ndc_mappings.result
        )
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("medication_greedy_chunk"),
    F.expr("cols['1']['entity']").alias("entity"),
    F.expr("cols['2']").alias("assertion"),
    F.expr("cols['3']").alias("rxnorm_code"),
    F.expr("cols['4']").alias("ndc_mappings")
).toPandas())

pp_transform_df

Unnamed: 0,medication_greedy_chunk,entity,assertion,rxnorm_code,ndc_mappings
0,Albuterol inhaler,DRUG,conditional,745678,NONE
1,Avandia 4 mg,DRUG,present,261242,00173-0835
2,Coumadin 5 mg,DRUG,present,855333,NONE
3,Metformin 1000 mg,DRUG,present,316255,NONE
4,Lisinopril 10 mg,DRUG,present,314076,00093-1113


# Pretrained Pipelines


This section provides an overview of pre-trained pipelines for medication and demonstrates how to load them from cloud and utilize them with a single line of code.



|index|model|index|model|
|-----:|:-----|-----:|:-----|
| 1| [ner_medication_pipeline](https://nlp.johnsnowlabs.com/2024/03/22/ner_medication_pipeline_en.html)  | 2| [ner_medication_generic_pipeline](https://nlp.johnsnowlabs.com/2021/03/31/ner_medication_generic_pipeline.html)  |
3| [medication_resolver_pipeline](https://nlp.johnsnowlabs.com/2024/03/20/medication_resolver_pipeline_en.html)  | 4| [medication_resolver_transform_pipeline](https://nlp.johnsnowlabs.com/2024/03/20/medication_resolver_transform_pipeline_en.html)  |






Let's load `medication_resolver_pipeline`.

In [None]:
from sparknlp.pretrained import PretrainedPipeline

ner_pipeline = PretrainedPipeline("medication_resolver_pipeline", "en", "clinical/models")

medication_resolver_pipeline download started this may take some time.
Approx size to download 3.2 GB
[OK!]


In [None]:
ner_pipeline.model.stages

[DocumentAssembler_f764e8967508,
 SentenceDetectorDLModel_6bafc4746ea5,
 REGEX_TOKENIZER_ae5dec356289,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_419f5f2e48fa,
 NER_CONVERTER_6977a7eabdd4,
 ENTITY_EXTRACTOR_991b113451e8,
 MERGE_d1bbb6781465,
 CHUNKER-MAPPER_7316b33a7307,
 CHUNKER-MAPPER_f4e3b6461abe,
 ChunkMapperFilterer_b60d3450ff3d,
 Chunk2Doc_653a064a6a42,
 BERT_SENTENCE_EMBEDDINGS_0bee53f1b2cc,
 ENTITY_3feb01c2d233,
 ResolverMerger_be3e19c4d171,
 CHUNKER-MAPPER_9db81cd8a096,
 CHUNKER-MAPPER_226ea5b0032a,
 CHUNKER-MAPPER_23ebe7abd111,
 CHUNKER-MAPPER_2d7b0e176787,
 CHUNKER-MAPPER_2d7b0e176787,
 CHUNKER-MAPPER_754c2579b746,
 Finisher_8b27ac12ffcc]

After loading the PretrainedPipeline from cloud, you can use it as a `LightPipeline`.

In [None]:
text = """The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera.The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet."""

result = ner_pipeline.fullAnnotate(text)[0]

result.keys()

dict_keys(['ner_chunk', 'NDC_Package', 'SNOMED_CT', 'RxNorm_Chunk', 'UMLS', 'Treatment', 'NDC_Product', 'ADE', 'Action', 'sentence'])

Let's see the results.

In [None]:
cols = ['ner_chunk', 'ADE', 'RxNorm_Chunk', 'Action', 'Treatment', 'UMLS', 'SNOMED_CT', "NDC_Package", "NDC_Product"]
res = {col : [a.result for a in result[col]] for col in cols}
res['entity'] = [a.metadata['entity'] for a in result['ner_chunk']]
#res['ADE']= [a.result for a in result['ADE']]

pd.DataFrame(res).replace('NONE', '-')

Unnamed: 0,ner_chunk,ADE,RxNorm_Chunk,Action,Treatment,UMLS,SNOMED_CT,NDC_Package,NDC_Product,entity
0,Amlodopine Vallarta 10-320mg,Gynaecomastia,722131,-,-,C1949334,425838008,00093-7693-56,00093-7693,DRUG
1,Eviplera,Anxiety,217010,Inhibitory Bone Resorption,Osteoporosis,C0720318,-,-,-,DRUG
2,Lescol 40 MG,-,103919,Hypocholesterolemic,Heterozygous Familial Hypercholesterolemia,C0353573,-,00078-0234-05,00078-0234,DRUG
3,Everolimus 1.5 mg tablet,Acute myocardial infarction,2056895,-,-,C4723581,-,00054-0604-21,00054-0604,DRUG
