![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/IOBTagger.ipynb)

# **IOBTagger**
Merges token tags and NER labels from chunks in the specified format. For example output columns as inputs from NerConverter and Tokenizer can be used to merge.

This notebook will cover the different parameters and usages of `IOBTagger`.

**📖 Learning Objectives:**

 Become comfortable using the different parameters of the `IOBTagger`.


**🔗 Helpful Links:**

- Documentation : [IOBTagger](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#iobtagger)

- Python Docs : [IOBTagger](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/ner/iob_tagger/index.html#sparknlp_jsl.annotator.ner.iob_tagger.IOBTagger)

- Scala Docs : [IOBTagger](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/ner/IOBTagger.html)

- For extended examples of usage, see the [Spark Healthcare NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp)

## **📜 Background**


`IOBTagger` will allow you to merge token tags and NER labels from chunks in the specified format.

## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m4.0 

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_8734_532.json to spark_nlp_for_healthcare_spark_ocr_8734_532.json


In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-5.3.2-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==5.3.2 installed! ✅ Heal the planet with NLP! 


In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN` , `CHUNK`

- Output: `NAMED_ENTITY`

In [None]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol( "document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")\
    .setCustomBounds(['\n'])

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Posology NER model is used
posology_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

iobTagger = medical.IOBTagger() \
    .setInputCols(["token", "ner_chunk"]) \
    .setOutputCol("ner_label")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    ner_converter,
    iobTagger ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]


In [None]:
text = 'The patient was prescribed 1 capsule of Advil for 5 days . He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months .'

filter_df = spark.createDataFrame([[text]]).toDF("text")

result = chunk_filter_model.transform(filter_df)

In [None]:
import pyspark.sql.functions as F

In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"),
                          F.expr("cols['2']['confidence']").alias("confidence"))

result_df.show(50, truncate=100)

+-------------+---------------------+----------+
|        token|            ner_label|confidence|
+-------------+---------------------+----------+
|          The|                    O|    0.9997|
|      patient|                    O|    0.9818|
|          was|                    O|    0.9739|
|   prescribed|                    O|    0.8993|
|            1|             B-Dosage|    0.9827|
|      capsule|             I-Dosage|     0.268|
|           of|                    O|    0.8908|
|        Advil|     B-Drug_BrandName|    0.9617|
|          for|           B-Duration|    0.9878|
|            5|           I-Duration|    0.9064|
|         days|           I-Duration|    0.9584|
|            .|                    O|    0.9941|
|           He|             B-Gender|       1.0|
|          was|                    O|      0.99|
|         seen|                    O|    0.9609|
|           by|                    O|    0.9767|
|          the|                    O|     0.869|
|endocrinology|     

In [None]:
result.select("ner_chunk.result").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1 capsule, Advil, for 5 days, He, endocrinology service, she, discharged, 40 units, insulin glargine, at night, 12 units, insulin lispro, with meals, metformin, 1000 mg, two times a day, SGLT2 inhibitors, fro 3 months]|
+---------------------------------------------------------------------------------------------------------------

In [None]:
result.selectExpr("explode(ner_label) as a").selectExpr("a.begin","a.end","a.result as chunk","a.metadata.word as word").where("chunk!='O'").show(truncate=False)

+-----+---+---------------------+-------------+
|begin|end|chunk                |word         |
+-----+---+---------------------+-------------+
|27   |27 |B-Dosage             |1            |
|29   |35 |I-Dosage             |capsule      |
|40   |44 |B-Drug_BrandName     |Advil        |
|46   |48 |B-Duration           |for          |
|50   |50 |I-Duration           |5            |
|52   |55 |I-Duration           |days         |
|59   |60 |B-Gender             |He           |
|78   |90 |B-Clinical_Dept      |endocrinology|
|92   |98 |I-Clinical_Dept      |service      |
|104  |106|B-Gender             |she          |
|112  |121|B-Admission_Discharge|discharged   |
|126  |127|B-Dosage             |40           |
|129  |133|I-Dosage             |units        |
|138  |144|B-Drug_Ingredient    |insulin      |
|146  |153|I-Drug_Ingredient    |glargine     |
|155  |156|B-Frequency          |at           |
|158  |162|I-Frequency          |night        |
|166  |167|B-Dosage             |12     