![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/IOBTagger.ipynb)

# **IOBTagger**
Merges token tags and NER labels from chunks in the specified format. For example output columns as inputs from NerConverter and Tokenizer can be used to merge.

This notebook will cover the different parameters and usages of `IOBTagger`.

**📖 Learning Objectives:**

 Become comfortable using the different parameters of the `IOBTagger`.


**🔗 Helpful Links:**

- Documentation : [IOBTagger](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#iobtagger)

- Python Docs : [IOBTagger](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/ner/iob_tagger/index.html#sparknlp_jsl.annotator.ner.iob_tagger.IOBTagger)

- Scala Docs : [IOBTagger](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/ner/IOBTagger.html)

- For extended examples of usage, see the [Spark Healthcare NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp)

## **📜 Background**


`IOBTagger` will allow you to merge token tags and NER labels from chunks in the specified format.

## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_10494.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==6.1.3, 💊Spark-Healthcare==6.1.1, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN` , `CHUNK`

- Output: `NAMED_ENTITY`

In [None]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol( "document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")\
    .setCustomBounds(['\n'])

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Posology NER model is used
posology_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_posology")

ner_posology_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner_posology"])\
    .setOutputCol("ner_jsl")

posology_greedy_ner = medical.NerModel.pretrained("ner_posology_greedy", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_posology_greedy")

ner_posology_greedy_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner_posology_greedy"])\
    .setOutputCol("ner_posology_greedy")

chunk_merger = medical.ChunkMergeApproach()\
    .setInputCols("ner_jsl", "ner_posology_greedy")\
    .setOutputCol('merged_ner_chunk')\

iobTagger = medical.IOBTagger() \
    .setInputCols(["token", "merged_ner_chunk"]) \
    .setOutputCol("ner_label")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    ner_posology_converter,
    posology_greedy_ner,
    ner_posology_greedy_converter,
    chunk_merger,
    iobTagger ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
Approximate size to download 14.5 MB
[OK!]
ner_posology_greedy download started this may take some time.
Approximate size to download 13.9 MB
[OK!]


In [None]:
text = 'The patient was prescribed 1 capsule of Advil for 5 days . He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months .'

filter_df = spark.createDataFrame([[text]]).toDF("text")

result = chunk_filter_model.transform(filter_df)

In [None]:
import pyspark.sql.functions as F

In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner_label.result, result.ner_label.metadata)).alias("cols"))\
    .select(F.expr("cols['0']").alias("token"),
            F.expr("cols['1']").alias("ner_label"),
            F.expr("cols['2']['ner_source']").alias("ner_source"))


result_df.show(50, truncate=100)

+-------------+---------------------+-------------------+
|        token|            ner_label|         ner_source|
+-------------+---------------------+-------------------+
|          The|                    O|                   |
|      patient|                    O|                   |
|          was|                    O|                   |
|   prescribed|                    O|                   |
|            1|               B-DRUG|ner_posology_greedy|
|      capsule|               I-DRUG|ner_posology_greedy|
|           of|               I-DRUG|ner_posology_greedy|
|        Advil|               I-DRUG|ner_posology_greedy|
|          for|           B-Duration|            ner_jsl|
|            5|           I-Duration|            ner_jsl|
|         days|           I-Duration|            ner_jsl|
|            .|                    O|                   |
|           He|             B-Gender|            ner_jsl|
|          was|                    O|                   |
|         seen

In [None]:
result.select("ner_label.result").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

In [None]:
result.selectExpr("explode(ner_label) as a").selectExpr("a.begin","a.end","a.result as chunk","a.metadata.word as word").where("chunk!='O'").show(truncate=False)

+-----+---+---------------------+-------------+
|begin|end|chunk                |word         |
+-----+---+---------------------+-------------+
|27   |27 |B-DRUG               |1            |
|29   |35 |I-DRUG               |capsule      |
|37   |38 |I-DRUG               |of           |
|40   |44 |I-DRUG               |Advil        |
|46   |48 |B-Duration           |for          |
|50   |50 |I-Duration           |5            |
|52   |55 |I-Duration           |days         |
|59   |60 |B-Gender             |He           |
|78   |90 |B-Clinical_Dept      |endocrinology|
|92   |98 |I-Clinical_Dept      |service      |
|104  |106|B-Gender             |she          |
|112  |121|B-Admission_Discharge|discharged   |
|126  |127|B-DRUG               |40           |
|129  |133|I-DRUG               |units        |
|135  |136|I-DRUG               |of           |
|138  |144|I-DRUG               |insulin      |
|146  |153|I-DRUG               |glargine     |
|155  |156|B-Frequency          |at     