![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **IOBTagger**
Merges token tags and NER labels from chunks in the specified format. For example output columns as inputs from NerConverter and Tokenizer can be used to merge.

This notebook will cover the different parameters and usages of `IOBTagger`.

**📖 Learning Objectives:**

 Become comfortable using the different parameters of the `IOBTagger`.


**🔗 Helpful Links:**

- Documentation : [IOBTagger](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#iobtagger)

- Python Docs : [IOBTagger](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/ner/iob_tagger/index.html#sparknlp_jsl.annotator.ner.iob_tagger.IOBTagger)

- Scala Docs : [IOBTagger](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/ner/IOBTagger.html)

- For extended examples of usage, see the [Spark Healthcare NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp)

## **📜 Background**


`IOBTagger` will allow you to merge token tags and NER labels from chunks in the specified format.

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

In [None]:
from johnsnowlabs import nlp


nlp.install(force_browser=True)

In [None]:
from johnsnowlabs import nlp, medical

spark = nlp.start()

📋 Loading license number 0 from /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.0.0, 💊Spark-Healthcare==5.0.0, running on ⚡ PySpark==3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN` , `CHUNK`

- Output: `NAMED_ENTITY`

In [None]:
documentAssembler = DocumentAssembler() \
data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
tokenizer = Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setOutputCol("embs")
nerModel = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models").setInputCols(["sentence", "token", "embs"]).setOutputCol("ner")
nerConverter = NerConverter().setInputCols(["sentence", "token", "ner"]).setOutputCol("ner_chunk")

iobTagger = IOBTagger().setInputCols(["token", "ner_chunk"]).setOutputCol("ner_label")
pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, embeddings, nerModel, nerConverter, iobTagger])

In [None]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol( "document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")\
    .setCustomBounds(['\n'])

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Posology NER model is used
posology_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

iobTagger = medical.IOBTagger() \
    .setInputCols(["token", "ner_chunk"]) \
    .setOutputCol("ner_label")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    ner_converter,
    iobTagger ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]


In [None]:
text = 'The patient was prescribed 1 capsule of Advil for 5 days . He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months .'

filter_df = spark.createDataFrame([[text]]).toDF("text")

result = chunk_filter_model.transform(filter_df)

In [None]:
import pyspark.sql.functions as F

In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"),
                          F.expr("cols['2']['confidence']").alias("confidence"))

result_df.show(50, truncate=100)

+-------------+---------------------+----------+
|        token|            ner_label|confidence|
+-------------+---------------------+----------+
|          The|                    O|    0.9997|
|      patient|                    O|    0.9818|
|          was|                    O|    0.9739|
|   prescribed|                    O|    0.8993|
|            1|             B-Dosage|    0.9827|
|      capsule|             I-Dosage|     0.268|
|           of|                    O|    0.8908|
|        Advil|     B-Drug_BrandName|    0.9617|
|          for|           B-Duration|    0.9878|
|            5|           I-Duration|    0.9064|
|         days|           I-Duration|    0.9584|
|            .|                    O|    0.9941|
|           He|             B-Gender|       1.0|
|          was|                    O|      0.99|
|         seen|                    O|    0.9609|
|           by|                    O|    0.9767|
|          the|                    O|     0.869|
|endocrinology|     

In [None]:
result.select("ner_chunk.result").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1 capsule, Advil, for 5 days, He, endocrinology service, she, discharged, 40 units, insulin glargine, at night, 12 units, insulin lispro, with meals, metformin, 1000 mg, two times a day, SGLT2 inhibitors, fro 3 months]|
+---------------------------------------------------------------------------------------------------------------

In [None]:
result.selectExpr("explode(ner_label) as a").selectExpr("a.begin","a.end","a.result as chunk","a.metadata.word as word").where("chunk!='O'").show(truncate=False)

+-----+---+----------------+-------------+
|begin|end|chunk           |word         |
+-----+---+----------------+-------------+
|0    |2  |0               |The          |
|4    |10 |0               |patient      |
|12   |14 |0               |was          |
|16   |25 |0               |prescribed   |
|27   |27 |B-Dosage        |1            |
|29   |35 |I-Dosage        |capsule      |
|37   |38 |0               |of           |
|40   |44 |B-Drug_BrandName|Advil        |
|46   |48 |B-Duration      |for          |
|50   |50 |I-Duration      |5            |
|52   |55 |I-Duration      |days         |
|57   |57 |0               |.            |
|59   |60 |B-Gender        |He           |
|62   |64 |0               |was          |
|66   |69 |0               |seen         |
|71   |72 |0               |by           |
|74   |76 |0               |the          |
|78   |90 |B-Clinical_Dept |endocrinology|
|92   |98 |I-Clinical_Dept |service      |
|100  |102|0               |and          |
+-----+---+