![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **NerConverterInternal**

This notebook will cover the different parameters and uses of `NerConverterInternal`.

This annotator converts a IOB or IOB2 representation of a named entity to a user-friendly one, by associating the tokens of recognized entities and their label. Chunks with no associated entity (tagged “O”) are filtered out.

<br/>


**📖 Learning Objectives:**

1. Understand how `NerConverterInternal` works.

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [NerConverterInternal](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#nerconverterinternal)

- Python Docs : [NerConverterInternal](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/ner/ner_converter_internal/index.html#sparknlp_jsl.annotator.ner.ner_converter_internal.NerConverterInternal)

- Scala Docs : [NerConverterInternal](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/ner/NerConverterInternal.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/01.0.Clinical_Named_Entity_Recognition_Model.ipynb).


## **📜 Background**

`NerConverterInternal` is an annotator in Spark NLP that is used to convert between different named entity recognition (NER) formats. It is typically used as part of a larger NER pipeline to convert the output of one NER model to a format that can be used as input to another NER model.

The `NerConverterInternal` annotator takes as input a dataframe containing annotations from a previous NER model, and produces as output a dataframe in a format that can be used as input to another NER model.

## **🎬 Colab Setup**

In [1]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m69.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m456.4 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m7.

In [2]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving 5.3.3.spark_nlp_for_healthcare.json to 5.3.3.spark_nlp_for_healthcare.json


In [3]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
🚨 Outdated Medical Secrets in license file. Version=5.3.3 but should be Version=5.3.2
🚨 Outdated OCR Secrets in license file. Version=5.1.2 but should be Version=5.3.2
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False 

In [4]:
import pyspark.sql.functions as F

import warnings
warnings.filterwarnings('ignore')

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()
spark

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `TOKEN`, `NAMED_ENTITY`

- Output: `CHUNK`

## **🔎 Parameters**

- `setThreshold`: Confidence threshold.

- `setWhiteList`: If defined, list of entities to process.

- `setBlackList`:  If defined, list of entities to ignore.   

- `setReplaceLabels`: If defined, contains a dictionary for entity replacement.

- `setPreservePosition`: Whether to preserve the original position of the tokens in the original document or use the modified tokens.

- `setReplaceDictResource`: If defined, path to the file containing a dictionary for entity replacement.

- `setIgnoreStopWords`: If defined, list of stop words to ignore.

- `setGreedyMode`: (Boolean) Whether to ignore B tags for contiguous tokens of same entity same .

- `setDoExceptionHandling`: (Boolean) If true, exceptions are handled.
If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.







## **💻 Pipeline**

In [5]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text")

sentence_detector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
embeddings  = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model
nerModel = medical.NerModel.pretrained("ner_posology", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

# NER Converter
nerConverter = medical.NerConverterInternal() \
   .setInputCols(["sentence", "token", "ner"]) \
   .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages = [document_assembler,
                              sentence_detector,
                              tokenizer,
                              embeddings,
                              nerModel,
                              nerConverter
                              ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology download started this may take some time.
[OK!]


In [6]:
sample_text = """The patient was prescribed 1 capsule of Advil for 5 days.
He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, metformin 1000 mg two times a day.
"""

data = spark.createDataFrame([[sample_text]]).toDF("text")

In [7]:
result = model.transform(data)

In [8]:
result.select('text', 'ner.result', 'ner_chunk.result').show(truncate = 50)

+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|                                              text|                                            result|                                            result|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The patient was prescribed 1 capsule of Advil f...|[O, O, O, O, B-DOSAGE, B-FORM, O, B-DRUG, B-DUR...|[1, capsule, Advil, for 5 days, 40 units, insul...|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+



In [9]:
result.select(F.explode(F.arrays_zip(result.token.result,
                                              result.ner.result,
                                              result.ner.metadata)).alias("cols")) \
               .select(F.expr("cols['0']").alias("token"),
                       F.expr("cols['1']").alias("ner_label"),
                       F.expr("cols['2']['confidence']").alias("confidence"))\
               .filter("ner_label!='O'")\
               .show(30, truncate=100)

+---------+-----------+----------+
|    token|  ner_label|confidence|
+---------+-----------+----------+
|        1|   B-DOSAGE|    0.9992|
|  capsule|     B-FORM|    0.9897|
|    Advil|     B-DRUG|     0.997|
|      for| B-DURATION|    0.5002|
|        5| I-DURATION|    0.6714|
|     days| I-DURATION|    0.9699|
|       40|   B-DOSAGE|    0.9933|
|    units|   I-DOSAGE|    0.6844|
|  insulin|     B-DRUG|    0.9982|
| glargine|     I-DRUG|    0.7503|
|       at|B-FREQUENCY|    0.5213|
|    night|I-FREQUENCY|    0.9919|
|       12|   B-DOSAGE|    0.9935|
|    units|   I-DOSAGE|    0.7507|
|  insulin|     B-DRUG|    0.9993|
|   lispro|     I-DRUG|    0.5637|
|     with|B-FREQUENCY|    0.7197|
|    meals|I-FREQUENCY|    0.9938|
|metformin|     B-DRUG|    0.9995|
|     1000| B-STRENGTH|    0.9732|
|       mg| I-STRENGTH|     0.521|
|      two|B-FREQUENCY|    0.9856|
|    times|I-FREQUENCY|    0.7584|
|        a|I-FREQUENCY|    0.5888|
|      day|I-FREQUENCY|    0.9989|
+---------+---------

In [10]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence")).show()

+----------------+---------+----------+
|           chunk|ner_label|confidence|
+----------------+---------+----------+
|               1|   DOSAGE|    0.9992|
|         capsule|     FORM|    0.9897|
|           Advil|     DRUG|     0.997|
|      for 5 days| DURATION|0.71383333|
|        40 units|   DOSAGE|   0.83885|
|insulin glargine|     DRUG|   0.87425|
|        at night|FREQUENCY|    0.7566|
|        12 units|   DOSAGE|    0.8721|
|  insulin lispro|     DRUG|    0.7815|
|      with meals|FREQUENCY|   0.85675|
|       metformin|     DRUG|    0.9995|
|         1000 mg| STRENGTH|    0.7471|
| two times a day|FREQUENCY|0.83292496|
+----------------+---------+----------+



#### LightPipeline

[LightPipeline](https://sparknlp.org/docs/en/concepts#lightpipeline) is a Spark NLP specific Pipeline class equivalent to Spark ML Pipeline.

The difference is that its execution does not hold to Spark principles, instead it computes everything locally (but in parallel) in order to achieve fast results when dealing with small amounts of data.

This means, we do not input a Spark Dataframe, but a string or an Array of strings instead, to be annotated.

In [11]:
light_model = nlp.LightPipeline(model)

light_result = light_model.annotate(sample_text)

list(zip(light_result['token'], light_result['ner']))

[('The', 'O'),
 ('patient', 'O'),
 ('was', 'O'),
 ('prescribed', 'O'),
 ('1', 'B-DOSAGE'),
 ('capsule', 'B-FORM'),
 ('of', 'O'),
 ('Advil', 'B-DRUG'),
 ('for', 'B-DURATION'),
 ('5', 'I-DURATION'),
 ('days', 'I-DURATION'),
 ('.', 'O'),
 ('He', 'O'),
 ('was', 'O'),
 ('seen', 'O'),
 ('by', 'O'),
 ('the', 'O'),
 ('endocrinology', 'O'),
 ('service', 'O'),
 ('and', 'O'),
 ('she', 'O'),
 ('was', 'O'),
 ('discharged', 'O'),
 ('on', 'O'),
 ('40', 'B-DOSAGE'),
 ('units', 'I-DOSAGE'),
 ('of', 'O'),
 ('insulin', 'B-DRUG'),
 ('glargine', 'I-DRUG'),
 ('at', 'B-FREQUENCY'),
 ('night', 'I-FREQUENCY'),
 (',', 'O'),
 ('12', 'B-DOSAGE'),
 ('units', 'I-DOSAGE'),
 ('of', 'O'),
 ('insulin', 'B-DRUG'),
 ('lispro', 'I-DRUG'),
 ('with', 'B-FREQUENCY'),
 ('meals', 'I-FREQUENCY'),
 (',', 'O'),
 ('metformin', 'B-DRUG'),
 ('1000', 'B-STRENGTH'),
 ('mg', 'I-STRENGTH'),
 ('two', 'B-FREQUENCY'),
 ('times', 'I-FREQUENCY'),
 ('a', 'I-FREQUENCY'),
 ('day', 'I-FREQUENCY'),
 ('.', 'O')]

In [12]:
light_result["ner_chunk"]

['1',
 'capsule',
 'Advil',
 'for 5 days',
 '40 units',
 'insulin glargine',
 'at night',
 '12 units',
 'insulin lispro',
 'with meals',
 'metformin',
 '1000 mg',
 'two times a day']

### `setThreshold`

This parameter may be used to define a confidence threshold to filter the chunk entities.

In [13]:
nerConverter = medical.NerConverterInternal() \
   .setInputCols(["sentence", "token", "ner"]) \
   .setOutputCol("ner_chunk")\
   .setThreshold(0.9)

nlpPipeline = nlp.Pipeline(stages = [document_assembler,
                              sentence_detector,
                              tokenizer,
                              embeddings,
                              nerModel,
                              nerConverter
                              ])

model = nlpPipeline.fit(empty_data)

result = model.transform(data)

In [14]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence")).show()

+---------+---------+----------+
|    chunk|ner_label|confidence|
+---------+---------+----------+
|        1|   DOSAGE|    0.9992|
|  capsule|     FORM|    0.9897|
|    Advil|     DRUG|     0.997|
|metformin|     DRUG|    0.9995|
+---------+---------+----------+



Defining a very high value as a threshold decreased the number of extracted entities.

### `setWhiteList`

`setWhiteList` gives the option to define a list of entities to process.

In [15]:
nerConverter = medical.NerConverterInternal() \
   .setInputCols(["sentence", "token", "ner"]) \
   .setOutputCol("ner_chunk")\
   .setWhiteList(["DRUG"])

nlpPipeline = nlp.Pipeline(stages = [document_assembler,
                              sentence_detector,
                              tokenizer,
                              embeddings,
                              nerModel,
                              nerConverter
                              ])

model = nlpPipeline.fit(empty_data)

result = model.transform(data)

In [16]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence")).show()

+----------------+---------+----------+
|           chunk|ner_label|confidence|
+----------------+---------+----------+
|           Advil|     DRUG|     0.997|
|insulin glargine|     DRUG|   0.87425|
|  insulin lispro|     DRUG|    0.7815|
|       metformin|     DRUG|    0.9995|
+----------------+---------+----------+



Using the `setWhiteList` parameter provided only the chunks of interest, not all the extracted entities.

### `setBlackList`

`setBlackList` gives the option to define a list of entities **not** to process.

In [17]:
nerConverter = medical.NerConverterInternal() \
   .setInputCols(["sentence", "token", "ner"]) \
   .setOutputCol("ner_chunk")\
   .setBlackList(["FREQUENCY"])

nlpPipeline = nlp.Pipeline(stages = [document_assembler,
                              sentence_detector,
                              tokenizer,
                              embeddings,
                              nerModel,
                              nerConverter
                              ])

model = nlpPipeline.fit(empty_data)

result = model.transform(data)

In [18]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence")).show()

+----------------+---------+----------+
|           chunk|ner_label|confidence|
+----------------+---------+----------+
|               1|   DOSAGE|    0.9992|
|         capsule|     FORM|    0.9897|
|           Advil|     DRUG|     0.997|
|      for 5 days| DURATION|0.71383333|
|        40 units|   DOSAGE|   0.83885|
|insulin glargine|     DRUG|   0.87425|
|        12 units|   DOSAGE|    0.8721|
|  insulin lispro|     DRUG|    0.7815|
|       metformin|     DRUG|    0.9995|
|         1000 mg| STRENGTH|    0.7471|
+----------------+---------+----------+



Using the `setBlackList` parameter produced all the chunks except the ones defined in the list.

### `setReplaceLabels`

This parameter will help to create a dictionary with the labels and their new values.

In [19]:
nerConverter = medical.NerConverterInternal() \
   .setInputCols(["sentence", "token", "ner"]) \
   .setOutputCol("ner_chunk")\
   .setReplaceLabels({"DRUG": "Drug_BrandName",
                       "FREQUENCY": "Drug_Frequency",
                       "DOSAGE": "Drug_Dosage",
                       "STRENGTH": "Drug_Strength",
                       "FORM": "Drug_Form",
                       "DURATION": "Drug_Duration"})

nlpPipeline = nlp.Pipeline(stages = [document_assembler,
                              sentence_detector,
                              tokenizer,
                              embeddings,
                              nerModel,
                              nerConverter
                              ])

model = nlpPipeline.fit(empty_data)

result = model.transform(data)

In [20]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence")).show()

+----------------+--------------+----------+
|           chunk|     ner_label|confidence|
+----------------+--------------+----------+
|               1|   Drug_Dosage|    0.9992|
|         capsule|     Drug_Form|    0.9897|
|           Advil|Drug_BrandName|     0.997|
|      for 5 days| Drug_Duration|0.71383333|
|        40 units|   Drug_Dosage|   0.83885|
|insulin glargine|Drug_BrandName|   0.87425|
|        at night|Drug_Frequency|    0.7566|
|        12 units|   Drug_Dosage|    0.8721|
|  insulin lispro|Drug_BrandName|    0.7815|
|      with meals|Drug_Frequency|   0.85675|
|       metformin|Drug_BrandName|    0.9995|
|         1000 mg| Drug_Strength|    0.7471|
| two times a day|Drug_Frequency|0.83292496|
+----------------+--------------+----------+



### `setPreservePosition`

This parameter is used to decide whether to preserve the original positions of the tokens in the original text or use the modified tokens.



In [21]:
nerConverter = medical.NerConverterInternal() \
   .setInputCols(["sentence", "token", "ner"]) \
   .setOutputCol("ner_chunk")\
   .setPreservePosition(True)

nlpPipeline = nlp.Pipeline(stages = [document_assembler,
                              sentence_detector,
                              tokenizer,
                              embeddings,
                              nerModel,
                              nerConverter
                              ])

model = nlpPipeline.fit(empty_data)

result = model.transform(data)

In [22]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence")).show()

+----------------+---------+----------+
|           chunk|ner_label|confidence|
+----------------+---------+----------+
|               1|   DOSAGE|    0.9992|
|         capsule|     FORM|    0.9897|
|           Advil|     DRUG|     0.997|
|      for 5 days| DURATION|0.71383333|
|        40 units|   DOSAGE|   0.83885|
|insulin glargine|     DRUG|   0.87425|
|        at night|FREQUENCY|    0.7566|
|        12 units|   DOSAGE|    0.8721|
|  insulin lispro|     DRUG|    0.7815|
|      with meals|FREQUENCY|   0.85675|
|       metformin|     DRUG|    0.9995|
|         1000 mg| STRENGTH|    0.7471|
| two times a day|FREQUENCY|0.83292496|
+----------------+---------+----------+



We set to the parameter to `True` in order to use the original labels; not the labels defined by the parameter`setReplaceLabels` before.

### `setReplaceDictResource`

This parameter is used to define the path to the file containing a dictionary for entity replacement.

In [23]:
dictionary = """Old_Label, New_Label
DRUG, Drug_BrandName
FREQUENCY,Drug_Frequency
DOSAGE,Drug_Dosage
STRENGTH,Drug_Strength
FORM, Drug_Form
DURATION, Drug_Duration
"""
with open('dictionary.csv', 'w') as f:
    f.write(dictionary)

In [24]:
nerConverter = medical.NerConverterInternal() \
   .setInputCols(["sentence", "token", "ner"]) \
   .setOutputCol("ner_chunk")\
   .setReplaceDictResource("/content/dictionary.csv","text", {"delimiter":","})

nlpPipeline = nlp.Pipeline(stages = [document_assembler,
                              sentence_detector,
                              tokenizer,
                              embeddings,
                              nerModel,
                              nerConverter
                              ])

model = nlpPipeline.fit(empty_data)

result = model.transform(data)

In [25]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence")).show()

+----------------+---------------+----------+
|           chunk|      ner_label|confidence|
+----------------+---------------+----------+
|               1|    Drug_Dosage|    0.9992|
|         capsule|      Drug_Form|    0.9897|
|           Advil| Drug_BrandName|     0.997|
|      for 5 days|  Drug_Duration|0.71383333|
|        40 units|    Drug_Dosage|   0.83885|
|insulin glargine| Drug_BrandName|   0.87425|
|        at night| Drug_Frequency|    0.7566|
|        12 units|    Drug_Dosage|    0.8721|
|  insulin lispro| Drug_BrandName|    0.7815|
|      with meals| Drug_Frequency|   0.85675|
|       metformin| Drug_BrandName|    0.9995|
|         1000 mg|  Drug_Strength|    0.7471|
| two times a day| Drug_Frequency|0.83292496|
+----------------+---------------+----------+



### `setIgnoreStopwords`

This parameter can be used to define list of stop words to ignore.

It should be a list of tokens/words or characters, and when two entities of the same type are separated by those words, these entities can be combined to produce a single, larger chunk.

First, let us create a pipeline (this time using a deidentification model) without the `setIgnoreStopwords` parameter and visualize the results.

In [26]:
nerModel = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_deid")

nerConverter = medical.NerConverterInternal() \
   .setInputCols(["sentence", "token", "ner_deid"]) \
   .setOutputCol("chunk_deid")\
   .setGreedyMode(True)\
   .setWhiteList(['LOCATION'])

nlpPipeline = nlp.Pipeline(stages = [document_assembler,
                              sentence_detector,
                              tokenizer,
                              embeddings,
                              nerModel,
                              nerConverter
                              ])

ner_converter_model = nlpPipeline.fit(empty_data)

ner_deid_generic_augmented download started this may take some time.
[OK!]


In [27]:
from sparknlp_display import NerVisualizer

text = """
The address of the manufacturer:
R K Industry House, Walbhat Rd
Mumbai, Maharashtra, India
"""

lmodel= nlp.LightPipeline(ner_converter_model)
res = lmodel.fullAnnotate(text)[0]

NerVisualizer().display(res, 'chunk_deid')

Now, let's define some characters and words with the `setIgnoreStopWords()` parameter and see the difference between the chunks.

In [28]:
nerConverter = medical.NerConverterInternal() \
   .setInputCols(["sentence", "token", "ner_deid"]) \
   .setOutputCol("chunk_deid")\
   .setGreedyMode(True)\
   .setWhiteList(['LOCATION'])\
   .setIgnoreStopWords(['\n', ',', "and", 'or', '.'])

nlpPipeline = nlp.Pipeline(stages = [document_assembler,
                              sentence_detector,
                              tokenizer,
                              embeddings,
                              nerModel,
                              nerConverter
                              ])

ner_converter_model = nlpPipeline.fit(empty_data)

In [29]:
lmodel= nlp.LightPipeline(ner_converter_model)
res = lmodel.fullAnnotate(text)[0]

NerVisualizer().display(res, 'chunk_deid')

Ignoring the stopwords caused a considerable change in the results.