![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/14.0.Drug_Normalizer.ipynb)

# Clinical Drug Normalizer

### New Annotator that transforms text to the format used in the RxNorm and SNOMED standards

It takes in input annotated documents of type Array\[AnnotatorType\](DOCUMENT) and gives as output annotated document of type AnnotatorType.DOCUMENT .

Parameters are:
- inputCol: input column name string which targets a column of type Array(AnnotatorType.DOCUMENT).
- outputCol: output column name string which targets a column of type AnnotatorType.DOCUMENT.
- lowercase: whether to convert strings to lowercase. Default is False.
- policy: rule to remove patterns from text.  Valid policy values are:  
  + **"all"**,   
  + **"abbreviations"**,   
  + **"dosages"**
   
Defaults is "all". "abbreviation" policy used to expend common drugs abbreviations, "dosages" policy used to convert drugs dosages and values to the standard form (see examples bellow).

#### Examples of transformation:
    
1) "Sodium Chloride/Potassium Chloride 13bag"  >>>  "Sodium Chloride / Potassium Chloride **13 bag**" : add extra spaces in the form entity

2) "interferon alfa-2b 10 million unit ( 1 ml ) injec" >>> "interferon alfa - 2b 10000000 unt ( 1 ml ) injection " : convert **10 million unit** to the **10000000 unt**, replace **injec** with **injection**

3) "aspirin 10 meq/ 5 ml oral sol" >>> "aspirin 2 meq/ml oral solution" : normalize **10 meq/ 5 ml** to the **2 meq/ml**, extend abbreviation **oral sol** to the **oral solution**

4) "adalimumab 54.5 + 43.2 gm" >>> "adalimumab 97700 mg" : combine **54.5 + 43.2** and normalize **gm** to **mg**

5) "Agnogenic one  half cup" >>> "Agnogenic 0.5 oral solution" : replace **one  half** to the **0.5**, normalize **cup** to the **oral solution**

# Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

# Start Spark Session

In [4]:
from johnsnowlabs import nlp, medical, visual

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_jsl.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.0.0, 💊Spark-Healthcare==5.0.0, running on ⚡ PySpark==3.1.2


In [5]:
# Sample data
data_to_normalize = spark.createDataFrame([
            ("A", "Sodium Chloride/Potassium Chloride 13bag", "Sodium Chloride / Potassium Chloride 13 bag"),
            ("B", "interferon alfa-2b 10 million unit ( 1 ml ) injec", "interferon alfa - 2b 10000000 unt ( 1 ml ) injection"),
            ("C", "aspirin 10 meq/ 5 ml oral sol", "aspirin 2 meq/ml oral solution")
        ]).toDF("cuid", "text", "target_normalized_text")

data_to_normalize.show(truncate=100)

+----+-------------------------------------------------+----------------------------------------------------+
|cuid|                                             text|                              target_normalized_text|
+----+-------------------------------------------------+----------------------------------------------------+
|   A|         Sodium Chloride/Potassium Chloride 13bag|         Sodium Chloride / Potassium Chloride 13 bag|
|   B|interferon alfa-2b 10 million unit ( 1 ml ) injec|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|
|   C|                    aspirin 10 meq/ 5 ml oral sol|                      aspirin 2 meq/ml oral solution|
+----+-------------------------------------------------+----------------------------------------------------+



In [6]:
# Annotator that transforms a text column from dataframe into normalized text (with all policy)

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

policy = "all"

drug_normalizer = medical.DrugNormalizer() \
    .setInputCols("document") \
    .setOutputCol("document_normalized") \
    .setPolicy(policy)

drug_normalizer_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    drug_normalizer
    ])

ds = drug_normalizer_pipeline.fit(data_to_normalize).transform(data_to_normalize)

ds = ds.selectExpr("document", "target_normalized_text", "explode(document_normalized.result) as all_normalized_text")
ds.show(truncate = False)

+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|document                                                                                   |target_normalized_text                              |all_normalized_text                                 |
+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|[{document, 0, 39, Sodium Chloride/Potassium Chloride 13bag, {sentence -> 0}, []}]         |Sodium Chloride / Potassium Chloride 13 bag         |Sodium Chloride / Potassium Chloride 13 bag         |
|[{document, 0, 48, interferon alfa-2b 10 million unit ( 1 ml ) injec, {sentence -> 0}, []}]|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|


In [7]:
# Annotator that transforms a text column from dataframe into normalized text (with abbreviations only policy)

policy = "abbreviations"

drug_normalizer_abb = medical.DrugNormalizer() \
    .setInputCols("document") \
    .setOutputCol("document_normalized_abbreviations") \
    .setPolicy(policy)

ds = drug_normalizer_abb.transform(ds)

ds = ds.selectExpr("document", "target_normalized_text", "all_normalized_text", "explode(document_normalized_abbreviations.result) as abbr_normalized_text")
ds.select("target_normalized_text", "all_normalized_text", "abbr_normalized_text").show(truncate=1000)

+----------------------------------------------------+----------------------------------------------------+-----------------------------------------------------+
|                              target_normalized_text|                                 all_normalized_text|                                 abbr_normalized_text|
+----------------------------------------------------+----------------------------------------------------+-----------------------------------------------------+
|         Sodium Chloride / Potassium Chloride 13 bag|         Sodium Chloride / Potassium Chloride 13 bag|             Sodium Chloride/Potassium Chloride 13bag|
|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|interferon alfa-2b 10 million unit ( 1 ml ) injection|
|                      aspirin 2 meq/ml oral solution|                      aspirin 2 meq/ml oral solution|                   aspirin 10 meq/ 5 ml oral solution|
+---------------------------

In [8]:
# Transform a text column from dataframe into normalized text (with dosages only policy)

policy = "dosages"

drug_normalizer_abb = medical.DrugNormalizer() \
    .setInputCols("document") \
    .setOutputCol("document_normalized_dosages") \
    .setPolicy(policy)

ds = drug_normalizer_abb.transform(ds)

ds.selectExpr("target_normalized_text", "all_normalized_text", "explode(document_normalized_dosages.result) as dos_normalized_text").show(truncate=1000)

+----------------------------------------------------+----------------------------------------------------+------------------------------------------------+
|                              target_normalized_text|                                 all_normalized_text|                             dos_normalized_text|
+----------------------------------------------------+----------------------------------------------------+------------------------------------------------+
|         Sodium Chloride / Potassium Chloride 13 bag|         Sodium Chloride / Potassium Chloride 13 bag|     Sodium Chloride / Potassium Chloride 13 bag|
|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|interferon alfa - 2b 10000000 unt ( 1 ml ) injec|
|                      aspirin 2 meq/ml oral solution|                      aspirin 2 meq/ml oral solution|                       aspirin 2 meq/ml oral sol|
+----------------------------------------------------+----

#### Apply normalizer only on NER chunks

In [9]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .addSplitChars(";")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# Extract entities with NER model posology
posology_ner = medical.NerModel.pretrained("ner_posology_large", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_posology")

# Group extracted entities into the chunks
ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_posology"])\
    .setOutputCol("ner_chunk_posology")

# Convert extracted entities to the doc with chunks in metadata
c2doc = nlp.Chunk2Doc()\
    .setInputCols("ner_chunk_posology")\
    .setOutputCol("chunk_doc")

# Transform a chunk document into normalized text
drug_normalizer = medical.DrugNormalizer() \
    .setInputCols("chunk_doc") \
    .setOutputCol("document_normalized_dosages")\
    .setPolicy("all")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    ner_converter,
    c2doc,
    drug_normalizer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology_large download started this may take some time.
[OK!]


In [10]:
! wget -q https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed_sample_text_small.csv

In [11]:
pubMedDF = spark.read\
                .option("header", "true")\
                .csv("pubmed_sample_text_small.csv")\

pubMedDF.show(truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potas...|
|BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer...|
|OBJECTIVE: To investigate the relationship between preoperative atrialfibrillation and early and ...|
|Combined EEG/fMRI recording has been used to localize the generators of EEGevents and to identify...|
|Kohlschutter syndrome is a rare neurodegenerative disorder presenting withintractable seizures, d...|
|Statistical analysis of neuroimages is commonly approached with intergroupcomparisons made by rep...|
|The synthetic DOX-LNA conjugate was characterized by proton nuclear magn

In [12]:
result = model.transform(pubMedDF.limit(100))

In [13]:
result.show(2)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------------------+
|                text|            document|            sentence|               token|          embeddings|        ner_posology|  ner_chunk_posology|           chunk_doc|document_normalized_dosages|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------------------+
|The human KCNJ9 (...|[{document, 0, 95...|[{document, 0, 12...|[{token, 0, 2, Th...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 52, 122,...|[{document, 52, 1...|       [{document, 52, 7...|
|BACKGROUND: At pr...|[{document, 0, 14...|[{document, 0, 19...|[{token, 0, 9, BA...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 167, 180...|[{document, 167, ...|       [{document, 167, ...|
+---------

In [14]:
import pyspark.sql.functions as F
result.select(F.explode('document_normalized_dosages.result')).show(20,truncate=150)

+-----------------------------------------------------------------------------+
|                                                                          col|
+-----------------------------------------------------------------------------+
|G - protein - activated inwardly rectifying potassium ( GIRK ) channel family|
|                                                                8 base - pair|
|                                                               anthracyclines|
|                                                                      taxanes|
|                                                                 usefulnessof|
|                                                                  vinorelbine|
|                                                                  vinorelbine|
|                                                               anthracyclines|
|                                                                      taxanes|
|                                       