![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/23.Drug_Normalizer.ipynb)

# 23.Clinical Drug Normalizer

### New Annotator that transforms text to the format used in the RxNorm and SNOMED standards

It takes in input annotated documents of type Array\[AnnotatorType\](DOCUMENT) and gives as output annotated document of type AnnotatorType.DOCUMENT .

Parameters are:
- inputCol: input column name string which targets a column of type Array(AnnotatorType.DOCUMENT).
- outputCol: output column name string which targets a column of type AnnotatorType.DOCUMENT.
- lowercase: whether to convert strings to lowercase. Default is False.
- policy: rule to remove patterns from text.  Valid policy values are:  
  + **"all"**,   
  + **"abbreviations"**,   
  + **"dosages"**
   
Defaults is "all". "abbreviation" policy used to expend common drugs abbreviations, "dosages" policy used to convert drugs dosages and values to the standard form (see examples bellow).

#### Examples of transformation:
    
1) "Sodium Chloride/Potassium Chloride 13bag"  >>>  "Sodium Chloride / Potassium Chloride **13 bag**" : add extra spaces in the form entity

2) "interferon alfa-2b 10 million unit ( 1 ml ) injec" >>> "interferon alfa - 2b 10000000 unt ( 1 ml ) injection " : convert **10 million unit** to the **10000000 unt**, replace **injec** with **injection**

3) "aspirin 10 meq/ 5 ml oral sol" >>> "aspirin 2 meq/ml oral solution" : normalize **10 meq/ 5 ml** to the **2 meq/ml**, extend abbreviation **oral sol** to the **oral solution**

4) "adalimumab 54.5 + 43.2 gm" >>> "adalimumab 97700 mg" : combine **54.5 + 43.2** and normalize **gm** to **mg**

5) "Agnogenic one  half cup" >>> "Agnogenic 0.5 oral solution" : replace **one  half** to the **0.5**, normalize **cup** to the **oral solution**

## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

# Colab Setup

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.4.1 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

In [None]:
import json
import os

from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession

import sparknlp_jsl
import sparknlp

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 6.1.3
Spark NLP_JSL Version : 6.1.1


In [None]:
 # if you want to start the session with custom params as in start function above

def start(secret):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:"+version)  \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+secret+"/spark-nlp-jsl-"+jsl_version+".jar")

    return builder.getOrCreate()

# spark = start(secret)

In [None]:
# Sample data
data_to_normalize = spark.createDataFrame([
            ("A", "Sodium Chloride/Potassium Chloride 13bag", "Sodium Chloride / Potassium Chloride 13 bag"),
            ("B", "interferon alfa-2b 10 million unit ( 1 ml ) injec", "interferon alfa - 2b 10000000 unt ( 1 ml ) injection"),
            ("C", "aspirin 10 meq/ 5 ml oral sol", "aspirin 2 meq/ml oral solution")
        ]).toDF("cuid", "text", "target_normalized_text")

data_to_normalize.show(truncate=100)

+----+-------------------------------------------------+----------------------------------------------------+
|cuid|                                             text|                              target_normalized_text|
+----+-------------------------------------------------+----------------------------------------------------+
|   A|         Sodium Chloride/Potassium Chloride 13bag|         Sodium Chloride / Potassium Chloride 13 bag|
|   B|interferon alfa-2b 10 million unit ( 1 ml ) injec|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|
|   C|                    aspirin 10 meq/ 5 ml oral sol|                      aspirin 2 meq/ml oral solution|
+----+-------------------------------------------------+----------------------------------------------------+



In [None]:
# Annotator that transforms a text column from dataframe into normalized text (with all policy)

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

policy = "all"

drug_normalizer = DrugNormalizer() \
    .setInputCols("document") \
    .setOutputCol("document_normalized") \
    .setPolicy(policy)

drug_normalizer_pipeline = Pipeline(stages=[
    document_assembler,
    drug_normalizer
    ])

ds = drug_normalizer_pipeline.fit(data_to_normalize).transform(data_to_normalize)

ds = ds.selectExpr("document", "target_normalized_text", "explode(document_normalized.result) as all_normalized_text")
ds.show(truncate = False)

+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|document                                                                                   |target_normalized_text                              |all_normalized_text                                 |
+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|[{document, 0, 39, Sodium Chloride/Potassium Chloride 13bag, {sentence -> 0}, []}]         |Sodium Chloride / Potassium Chloride 13 bag         |Sodium Chloride / Potassium Chloride 13 bag         |
|[{document, 0, 48, interferon alfa-2b 10 million unit ( 1 ml ) injec, {sentence -> 0}, []}]|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|


In [None]:
# Annotator that transforms a text column from dataframe into normalized text (with abbreviations only policy)

policy = "abbreviations"

drug_normalizer_abb = DrugNormalizer() \
    .setInputCols("document") \
    .setOutputCol("document_normalized_abbreviations") \
    .setPolicy(policy)

ds = drug_normalizer_abb.transform(ds)

ds = ds.selectExpr("document", "target_normalized_text", "all_normalized_text", "explode(document_normalized_abbreviations.result) as abbr_normalized_text")
ds.select("target_normalized_text", "all_normalized_text", "abbr_normalized_text").show(truncate=1000)

+----------------------------------------------------+----------------------------------------------------+-----------------------------------------------------+
|                              target_normalized_text|                                 all_normalized_text|                                 abbr_normalized_text|
+----------------------------------------------------+----------------------------------------------------+-----------------------------------------------------+
|         Sodium Chloride / Potassium Chloride 13 bag|         Sodium Chloride / Potassium Chloride 13 bag|             Sodium Chloride/Potassium Chloride 13bag|
|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|interferon alfa-2b 10 million unit ( 1 ml ) injection|
|                      aspirin 2 meq/ml oral solution|                      aspirin 2 meq/ml oral solution|                   aspirin 10 meq/ 5 ml oral solution|
+---------------------------

In [None]:
# Transform a text column from dataframe into normalized text (with dosages only policy)

policy = "dosages"

drug_normalizer_abb = DrugNormalizer() \
    .setInputCols("document") \
    .setOutputCol("document_normalized_dosages") \
    .setPolicy(policy)

ds = drug_normalizer_abb.transform(ds)

ds.selectExpr("target_normalized_text", "all_normalized_text", "explode(document_normalized_dosages.result) as dos_normalized_text").show(truncate=1000)

+----------------------------------------------------+----------------------------------------------------+------------------------------------------------+
|                              target_normalized_text|                                 all_normalized_text|                             dos_normalized_text|
+----------------------------------------------------+----------------------------------------------------+------------------------------------------------+
|         Sodium Chloride / Potassium Chloride 13 bag|         Sodium Chloride / Potassium Chloride 13 bag|     Sodium Chloride / Potassium Chloride 13 bag|
|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|interferon alfa - 2b 10000000 unt ( 1 ml ) injec|
|                      aspirin 2 meq/ml oral solution|                      aspirin 2 meq/ml oral solution|                       aspirin 2 meq/ml oral sol|
+----------------------------------------------------+----

#### Apply normalizer only on NER chunks

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .addSplitChars(";")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# Extract entities with NER model posology
posology_ner = MedicalNerModel.pretrained("ner_posology_large", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_posology")

# Group extracted entities into the chunks
ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_posology"])\
    .setOutputCol("ner_chunk_posology")

# Convert extracted entities to the doc with chunks in metadata
c2doc = Chunk2Doc()\
    .setInputCols("ner_chunk_posology")\
    .setOutputCol("chunk_doc")

# Transform a chunk document into normalized text
drug_normalizer = DrugNormalizer() \
    .setInputCols("chunk_doc") \
    .setOutputCol("document_normalized_dosages")\
    .setPolicy("all")

nlpPipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    ner_converter,
    c2doc,
    drug_normalizer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology_large download started this may take some time.
Approximate size to download 13.8 MB
[OK!]


In [None]:
! wget -q https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed_sample_text_small.csv

In [None]:
import pyspark.sql.functions as F

pubMedDF = spark.read\
                .option("header", "true")\
                .csv("pubmed_sample_text_small.csv")\

pubMedDF.show(truncate=50)

+--------------------------------------------------+
|                                              text|
+--------------------------------------------------+
|The human KCNJ9 (Kir 3.3, GIRK3) is a member of...|
|BACKGROUND: At present, it is one of the most i...|
|OBJECTIVE: To investigate the relationship betw...|
|Combined EEG/fMRI recording has been used to lo...|
|Kohlschutter syndrome is a rare neurodegenerati...|
|Statistical analysis of neuroimages is commonly...|
|The synthetic DOX-LNA conjugate was characteriz...|
|Our objective was to compare three different me...|
|We conducted a phase II study to assess the eff...|
|"""Monomeric sarcosine oxidase (MSOX) is a flav...|
|We presented the tachinid fly Exorista japonica...|
|The literature dealing with the water conductin...|
|A novel approach to synthesize chitosan-O-isopr...|
|An HPLC-ESI-MS-MS method has been developed for...|
|The localizing and lateralizing values of eye a...|
|OBJECTIVE: To evaluate the effectiveness and 

In [None]:
result = model.transform(pubMedDF.limit(100))

In [None]:
result.show(2)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------------------+
|                text|            document|            sentence|               token|          embeddings|        ner_posology|  ner_chunk_posology|           chunk_doc|document_normalized_dosages|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------------------+
|The human KCNJ9 (...|[{document, 0, 95...|[{document, 0, 12...|[{token, 0, 2, Th...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 52, 122,...|[{document, 52, 1...|       [{document, 52, 7...|
|BACKGROUND: At pr...|[{document, 0, 14...|[{document, 0, 19...|[{token, 0, 9, BA...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 167, 180...|[{document, 167, ...|       [{document, 167, ...|
+---------

In [None]:
import pyspark.sql.functions as F
result.select(F.explode('document_normalized_dosages.result')).show(truncate=100)

+-----------------------------------------------------------------------------+
|                                                                          col|
+-----------------------------------------------------------------------------+
|G - protein - activated inwardly rectifying potassium ( GIRK ) channel family|
|                                                                8 base - pair|
|                                                               anthracyclines|
|                                                                      taxanes|
|                                                                 usefulnessof|
|                                                                  vinorelbine|
|                                                                  vinorelbine|
|                                                               anthracyclines|
|                                                                      taxanes|
|                                       