![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/40.Rule_Based_Entity_Matchers.ipynb)

## **🎬 Colab Setup**

In [1]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [2]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.4.1 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install --upgrade -q spark-nlp-display

In [3]:
import json
import os

import sparknlp
import sparknlp_jsl

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
from sparknlp.util import *
from sparknlp.pretrained import ResourceDownloader

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline, PipelineModel

import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

import string
import numpy as np

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(secret = license_keys["SECRET"], params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.3.0
Spark NLP_JSL Version : 5.3.0


#   **📜 RegexMatcherInternal**

The **`RegexMatcherInternal`** class implements an internal annotator approach to match a set of regular expressions with a provided entity. This approach is utilized for associating specific patterns within text data with predetermined entities, such as dates, mentioned within the text.

The class allows users to define rules using regular expressions paired with entities, offering flexibility in customization. These rules can either be directly set using the `setRules` method, with a specified delimiter, or loaded from an external file using the `setExternalRules` method.

Additionally, users can specify parameters such as the matching strategy (`MATCH_FIRST`, `MATCH_ALL`, or `MATCH_COMPLETE`) to control how matches are handled. The output annotation type is `CHUNK`, with input annotation types supporting `DOCUMENT`. This class provides a versatile tool for implementing entity recognition based on user-defined patterns within text data.


A rule consists of a regex pattern and an identifier, delimited by a character of choice. An example could be `"\\d{4}\\/\\d\\d\\/\\d\\d,date"` which will match strings like `"1970/01/01"` to the identifier `"date"`.

**🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`

- Output: `CHUNK`

**🔎 Parameters**

**Parameters**:

- `strategy`: Can be either `MATCH_FIRST`, `MATCH_ALL`, `MATCH_COMPLETE`, by default `MATCH_ALL`.
- `rules`: Regex rules to match the identifier with.
- `delimiter`: Delimiter for rules provided with setRules.
- `externalRules`: external resource to rules, needs `delimiter` in options.

### IP and DATE

In [4]:
text = """Name : Hendrickson, Ora, Record date: 2093-01-13, MR #719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no: A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""

data = spark.createDataFrame([[text]]).toDF("text")

In [5]:
!mkdir -p rules

rules = '''
(\d{1,3}\.){3}\d{1,3}~IPADDR
\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{2}|\d{2}/\d{2}/\d{2}~DATE
'''

with open('./rules/regex_rules.txt', 'w') as f:
    f.write(rules)

In [6]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

regex_matcher_internal = RegexMatcherInternal()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='./rules/regex_rules.txt', delimiter='~')

nlpPipeline = Pipeline(
    stages=[
        document_assembler,
        regex_matcher_internal
])

result = nlpPipeline.fit(data).transform(data)

In [7]:
result.select('regex_matches.result','regex_matches.metadata').show(truncate=70)

+--------------------------------------+----------------------------------------------------------------------+
|                                result|                                                              metadata|
+--------------------------------------+----------------------------------------------------------------------+
|[2093-01-13, 203.120.223.13, 01/13/93]|[{entity -> DATE, chunk -> 0, sentence -> 0}, {entity -> IPADDR, ch...|
+--------------------------------------+----------------------------------------------------------------------+



In [8]:
result_df = result.select(F.explode(F.arrays_zip(result.regex_matches.result,
                                                 result.regex_matches.begin,
                                                 result.regex_matches.end,
                                                 result.regex_matches.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("regex_result"),
                          F.expr("cols['1']").alias("begin"),
                          F.expr("cols['2']").alias("end"),
                          F.expr("cols['3']['entity']").alias("ner_label"))
result_df.show()

+--------------+-----+---+---------+
|  regex_result|begin|end|ner_label|
+--------------+-----+---+---------+
|    2093-01-13|   38| 47|     DATE|
|203.120.223.13|   97|110|   IPADDR|
|      01/13/93|  188|195|     DATE|
+--------------+-----+---+---------+



### AGE

- `(?i)(?<=((age of|age)))(\d{1,3})~AGE` this rule inludes prefixes for `age of` and `age`
- `(\d{1,3})(?i)(?=(([^a-z0-9]{0,3}(-years-old|years-old|-year-old))))~AGE` this rule includes suffixes for `-years-old`, `years-old`, and `-year-old`


In [9]:
regex_rules_age = """
(?i)(?<=((age of|age)))(\d{1,3})~AGE
(\d{1,3})(?i)(?=(([^a-z0-9]{0,3}(-years-old|years-old|-year-old))))~AGE
"""
with open('./rules/age_rules.txt', 'w') as f:
    f.write(regex_rules_age)

In [10]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

regex_matcher_internal = RegexMatcherInternal()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules('./rules/age_rules.txt', delimiter='~')


nlpPipeline = Pipeline(
    stages=[
        document_assembler,
        regex_matcher_internal
])

result = nlpPipeline.fit(data).transform(data)

In [11]:
result.select('regex_matches.result','regex_matches.metadata').show(truncate=70)


+------+--------------------------------------------+
|result|                                    metadata|
+------+--------------------------------------------+
|  [60]|[{entity -> AGE, chunk -> 0, sentence -> 0}]|
+------+--------------------------------------------+



In [12]:
result = result.select(F.explode(F.arrays_zip(result.regex_matches.result,
                                              result.regex_matches.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("regex_result"),
                          F.expr("cols['1']['entity']").alias("ner_label")).show()

+------------+---------+
|regex_result|ner_label|
+------------+---------+
|          60|      AGE|
+------------+---------+



## Create Custom Regex Rules

In [13]:
def write_rule(file_path, rules):

    prefix_rule = "(?i)(?<=((({})[^a-z0-9]{})))"
    prefix_rule_init = "(?i)(?<=(({})))"
    suffix_rule = "(?i)(?=(([^a-z0-9]{}({}))))"
    suffix_rule_init = "(?i)(?=(({})))"
    with open(file_path, 'w') as f:

        for label in list(rules.keys()):
            if len(rules[label]['prefix'])>0:
                rule = prefix_rule_init.format("|".join(rules[label]['prefix'])) + rules[label]['rule'] + f"~{label}"
                f.write(rule)
                f.write('\n')
                for i in range(1,rules[label]['contextLength']):
                    rule = prefix_rule.format("|".join(rules[label]['prefix']),'{'+str(i)+'}') + rules[label]['rule'] + f"~{label}"
                    f.write(rule)
                    f.write('\n')
            try:

                if len(rules[label]['suffix'])>0:
                    rule = suffix_rule_init.format("|".join(rules[label]['suffix'])) + rules[label]['rule'] + f"~{label}"
                    f.write(rule)
                    f.write('\n')
                    for i in range(1,rules[label]['contextLength']):
                        rule = rules[label]['rule'] + suffix_rule.format('{'+str(i)+'}', "|".join(rules[label]['suffix'])) + f"~{label}"
                        f.write(rule)
                        f.write('\n')
            except:
                continue

In [14]:
!mkdir -p rules regex_models

### SSN

In [15]:
rule_path = "rules/ssn_regex_rule.txt"
model_path = "regex_models/ssn_regex_parser_model"
context_length = 3

regex_rules_ssn = {
    'SSN' :
        {
            'rule' : '(\d{3}-\d{2}-\d{4})',
            'prefix' : [
                "social", "security", "ss#", "ssn#","ssid", "ss #", "ssn #", "SSA Number", "social security number",
                "social security #", "social security#", "social security no","Soc Sec", "SSN", "SSNS", "SSN#", "SS#", "SSID"
                ],
            'label' : 'SSN',
            'contextLength' : context_length
        }
}

write_rule(rule_path, regex_rules_ssn)

ssn_regex_matcher = RegexMatcherInternal()\
    .setExternalRules(rule_path,  "~") \
    .setInputCols(["document"]) \
    .setOutputCol("ssn_matched_text") \
    .setStrategy("MATCH_ALL")

regex_parser_pipeline = Pipeline(
    stages=[
      document_assembler,
      ssn_regex_matcher
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

regex_parser_model = regex_parser_pipeline.fit(empty_data)
regex_parser_model.stages[-1].write().overwrite().save(model_path)

### AGE

In [16]:
rule_path = "rules/age_regex_rule.txt"
model_path = "regex_models/age_regex_parser_model"

with open(rule_path, 'w') as f:
    f.write("""(?i)(?<=((age of|age)))(\d{1,3})~AGE
               (\d{1,3})(?i)(?=(([^a-z0-9]{0,3}(-years-old|years-old|-year-old))))~AGE""")


age_regex_matcher = RegexMatcherInternal()\
    .setExternalRules(rule_path,  "~") \
    .setInputCols(["document"]) \
    .setOutputCol("age_matched_text") \
    .setStrategy("MATCH_ALL")

regex_parser_pipeline = Pipeline(
    stages=[
      document_assembler,
      age_regex_matcher
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

regex_parser_model = regex_parser_pipeline.fit(empty_data)
regex_parser_model.stages[-1].write().overwrite().save(model_path)

### MAIL

In [17]:
rule_path = "rules/mail_regex_rule.txt"
model_path = "regex_models/mail_regex_parser_model"

with open(rule_path, 'w') as f:
    f.write("""[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}~EMAIL""")


mail_regex_matcher = RegexMatcherInternal()\
    .setExternalRules(rule_path,  "~") \
    .setInputCols(["document"]) \
    .setOutputCol("mail_matched_text") \
    .setStrategy("MATCH_ALL")

regex_parser_pipeline = Pipeline(
    stages=[
      document_assembler,
      mail_regex_matcher
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

regex_parser_model = regex_parser_pipeline.fit(empty_data)
regex_parser_model.stages[-1].write().overwrite().save(model_path)

### PHONE

In [18]:
rule_path = "rules/phone_regex_rule.txt"
model_path = "regex_models/phone_regex_parser_model"

with open(rule_path, 'w') as f:
    f.write("""\(\d{3}\) \d{3}-\d{4}~PHONE""")


phone_regex_matcher = RegexMatcherInternal()\
    .setExternalRules(rule_path,  "~") \
    .setInputCols(["document"]) \
    .setOutputCol("phone_matched_text") \
    .setStrategy("MATCH_ALL")

regex_parser_pipeline = Pipeline(
    stages=[
      document_assembler,
      phone_regex_matcher
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

regex_parser_model = regex_parser_pipeline.fit(empty_data)
regex_parser_model.stages[-1].write().overwrite().save(model_path)

## RegexMatcherModel

In [19]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

regex_matcher_ssn = RegexMatcherModel.load("regex_models/ssn_regex_parser_model")\
    .setInputCols(["sentence"]) \
    .setOutputCol("ssn_matched_text")

regex_matcher_age = RegexMatcherModel.load("regex_models/age_regex_parser_model")\
    .setInputCols(["sentence"]) \
    .setOutputCol("age_matched_text")

regex_matcher_mail = RegexMatcherModel.load("regex_models/mail_regex_parser_model")\
    .setInputCols(["sentence"]) \
    .setOutputCol("mail_matched_text")

regex_matcher_phone = RegexMatcherModel.load("regex_models/phone_regex_parser_model")\
    .setInputCols(["sentence"]) \
    .setOutputCol("phone_matched_text")

chunk_merge = ChunkMergeApproach()\
      .setInputCols("ssn_matched_text",
                    "age_matched_text",
                    "mail_matched_text",
                    "phone_matched_text")\
      .setOutputCol("ner_chunk")\
      .setMergeOverlapping(True)\
      .setChunkPrecedence("field")

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]


In [20]:
nlpPipeline = Pipeline(stages=[
      document_assembler,
      sentence_detector,
      tokenizer,
      regex_matcher_ssn,
      regex_matcher_age,
      regex_matcher_mail,
      regex_matcher_phone,
      chunk_merge
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

regex_pipeline_model = nlpPipeline.fit(empty_data)
light_model = LightPipeline(regex_pipeline_model)

## Using LightPipeline

In [21]:
text = """Name : Hendrickson, Ora, Record date: 2093-01-13, MR #719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13.
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no: A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""

result = light_model.fullAnnotate(text)

In [22]:
result[0].keys()

dict_keys(['document', 'ner_chunk', 'phone_matched_text', 'age_matched_text', 'ssn_matched_text', 'token', 'sentence', 'mail_matched_text'])

In [23]:
result[0]['ner_chunk']

[Annotation(chunk, 121, 122, 60, {'entity': 'AGE', 'chunk': '0', 'sentence': '2'}, []),
 Annotation(chunk, 239, 249, 333-44-6666, {'entity': 'SSN', 'chunk': '1', 'sentence': '3'}, []),
 Annotation(chunk, 289, 302, (302) 786-5227, {'entity': 'PHONE', 'chunk': '2', 'sentence': '4'}, []),
 Annotation(chunk, 347, 361, smith@gmail.com, {'entity': 'EMAIL', 'chunk': '3', 'sentence': '4'}, [])]

In [24]:
ner_chunk = []
ner_label = []
begin = []
end = []

for n in result[0]['ner_chunk']:

    begin.append(n.begin)
    end.append(n.end)
    ner_chunk.append(n.result)
    ner_label.append(n.metadata['entity'])


import pandas as pd

df = pd.DataFrame({'ner_chunk':ner_chunk, 'begin': begin, 'end':end,
                   'ner_label':ner_label})

df

Unnamed: 0,ner_chunk,begin,end,ner_label
0,60,121,122,AGE
1,333-44-6666,239,249,SSN
2,(302) 786-5227,289,302,PHONE
3,smith@gmail.com,347,361,EMAIL


## Transform

In [25]:
empty_data = spark.createDataFrame([[text]]).toDF("text")

result = regex_pipeline_model.transform(empty_data)
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|    ssn_matched_text|    age_matched_text|   mail_matched_text|  phone_matched_text|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Name : Hendrickso...|[{document, 0, 36...|[{document, 0, 60...|[{token, 0, 3, Na...|[{chunk, 239, 249...|[{chunk, 121, 122...|[{chunk, 347, 361...|[{chunk, 289, 302...|[{chunk, 121, 122...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [26]:
result = result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                              result.ner_chunk.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("ner_chunk"),
                          F.expr("cols['1']['entity']").alias("ner_label")).show()

+---------------+---------+
|      ner_chunk|ner_label|
+---------------+---------+
|             60|      AGE|
|    333-44-6666|      SSN|
| (302) 786-5227|    PHONE|
|smith@gmail.com|    EMAIL|
+---------------+---------+



## Pretrained Pipeline

In [27]:
regex_pipeline_model.write().overwrite().save("regex_pipeline_model")

In [28]:
from sparknlp.pretrained import PretrainedPipeline

regex_pipeline_loaded = PretrainedPipeline.from_disk("regex_pipeline_model")

In [29]:
result = regex_pipeline_loaded.fullAnnotate(text)
result[0].keys()

dict_keys(['document', 'ner_chunk', 'phone_matched_text', 'age_matched_text', 'ssn_matched_text', 'token', 'sentence', 'mail_matched_text'])

In [30]:
ner_chunk = []
ner_label = []
begin = []
end = []

for n in result[0]['ner_chunk']:

    begin.append(n.begin)
    end.append(n.end)
    ner_chunk.append(n.result)
    ner_label.append(n.metadata['entity'])


import pandas as pd

df = pd.DataFrame({'ner_chunk':ner_chunk,
                   'begin': begin, 'end':end,
                   'ner_label':ner_label})

df

Unnamed: 0,ner_chunk,begin,end,ner_label
0,60,121,122,AGE
1,333-44-6666,239,249,SSN
2,(302) 786-5227,289,302,PHONE
3,smith@gmail.com,347,361,EMAIL


# 📜 TextMatcherInternal

In this notebook, we will examine the `TextMatcherInternal` annotator and its model version `TextMatcherInternalModel`.

This annotator match exact phrases provided in a file against a
Document.


**📖 Learning Objectives:**

1. Understand how to match exact phrases by using pre-defined dictionary.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

For extended examples of usage, see the [Spark NLP Workshop Repository]()

Python Documentation: [TextMatcherInternal]()

Scala Documentation: [TextMatcherInternal]()


**🖨️ Input/Output Annotation Types**
- Input: ``DOCUMENT`` , ``TOKEN``    
- Output: ``CHUNK``

**🔎 Parameters**


- `setEntities` *(str)*: Sets the external resource for the entities.
        path : str
            Path to the external resource
        read_as : str, optional
            How to read the resource, by default ReadAs.TEXT
        options : dict, optional
            Options for reading the resource, by default {"format": "text"}
- `setCaseSensitive` *(Boolean)*: Sets whether to match regardless of case. (Default: True)

- `setMergeOverlapping` *(Boolean)*:Sets whether to merge overlapping matched chunks. (Default: False)

- `setEntityValue` *(str)*: Sets the value for the entity metadata field. If any entity value isn't set in the file, we need to set it for the entity value.

- `setBuildFromTokens` *(Boolean)*:  Sets whether the TextMatcherInternal should take the CHUNK from TOKEN.

- `setDelimiter` *(str)*:  Sets Value for the delimiter between Phrase, Entity.





## How to Use `TextMatcherInternal`

First of all, we should create a source file that includes all the chunks or tokens we need to capture. In the example below, we use `#` as a delimiter to separate the label and entity. So we need to set parameter like this `setDelimiter('#')`.

In [31]:
matcher_drug = """
Aspirin 100mg#Drug
aspirin#Drug
paracetamol#Drug
amoxicillin#Drug
ibuprofen#Drug
lansoprazole#Drug
"""

with open ('matcher_drug.csv', 'w') as f:
  f.write(matcher_drug)

In [32]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

entityExtractor = TextMatcherInternal()\
    .setInputCols(["document", "token"])\
    .setEntities("matcher_drug.csv")\
    .setOutputCol("matched_text")\
    .setCaseSensitive(False)\
    .setDelimiter("#")\
    .setMergeOverlapping(False)

mathcer_pipeline = Pipeline().setStages([
                  documentAssembler,
                  tokenizer,
                  entityExtractor])

data = spark.createDataFrame([["John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD."]]).toDF("text")

result = mathcer_pipeline.fit(data).transform(data)

In [33]:
result.select(F.explode(F.arrays_zip(
              result.matched_text.result,
              result.matched_text.begin,
              result.matched_text.end,
              result.matched_text.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+-------------+-----+---+-----+
|        chunk|begin|end|label|
+-------------+-----+---+-----+
|      aspirin|   25| 31| Drug|
|aspirin 100mg|   25| 37| Drug|
|  paracetamol|   75| 85| Drug|
|  amoxicillin|  102|112| Drug|
|    ibuprofen|  134|142| Drug|
| lansoprazole|  170|181| Drug|
+-------------+-----+---+-----+



As you see above mather_drug file includes 2 similar entities aspirin and aspirin 100mg and our text includes both of them So if you want to see both of them you need to set `MergeOverlapping` parameter as `False`. You can look at the below example.

In [34]:
entityExtractor = TextMatcherInternal()\
    .setInputCols(["document", "token"])\
    .setEntities("matcher_drug.csv") \
    .setOutputCol("matched_text")\
    .setCaseSensitive(False)\
    .setDelimiter("#")\
    .setMergeOverlapping(False)

mathcer_pipeline = Pipeline().setStages([
                  documentAssembler,
                  tokenizer,
                  entityExtractor])

result = mathcer_pipeline.fit(data).transform(data)

In [35]:
result.select(F.explode(F.arrays_zip(
              result.matched_text.result,
              result.matched_text.begin,
              result.matched_text.end,
              result.matched_text.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+-------------+-----+---+-----+
|        chunk|begin|end|label|
+-------------+-----+---+-----+
|      aspirin|   25| 31| Drug|
|aspirin 100mg|   25| 37| Drug|
|  paracetamol|   75| 85| Drug|
|  amoxicillin|  102|112| Drug|
|    ibuprofen|  134|142| Drug|
| lansoprazole|  170|181| Drug|
+-------------+-----+---+-----+



When we set the `CaseSensitive` parameter to `True`, it means we're considering the case sensitivity of chunks in the source file. Consequently, some chunks may not be visible due to differences in their case compared to the source file.

In [36]:
entityExtractor = TextMatcherInternal()\
    .setInputCols(["document", "token"])\
    .setEntities("matcher_drug.csv") \
    .setOutputCol("matched_text")\
    .setCaseSensitive(True)\
    .setDelimiter("#")\
    .setMergeOverlapping(False)

mathcer_pipeline = Pipeline().setStages([
                  documentAssembler,
                  tokenizer,
                  entityExtractor])

matcher_model = mathcer_pipeline.fit(data)
result = matcher_model.transform(data)

In [37]:
result.select(F.explode(F.arrays_zip(
              result.matched_text.result,
              result.matched_text.begin,
              result.matched_text.end,
              result.matched_text.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+------------+-----+---+-----+
|       chunk|begin|end|label|
+------------+-----+---+-----+
|     aspirin|   25| 31| Drug|
| paracetamol|   75| 85| Drug|
| amoxicillin|  102|112| Drug|
|   ibuprofen|  134|142| Drug|
|lansoprazole|  170|181| Drug|
+------------+-----+---+-----+



## Multiple Entities

We can set multiple entities in the same file.

In [38]:
multiple_entites= """
Aspirin 100mg#Drug
paracetamol#Drug
amoxicillin#Drug
ibuprofen#Drug
lansoprazole#Drug
fever#Symptom
headache#Symptom
tonsilitis#Disease
GORD#Disease
heart condition#Disease
"""

with open ('multiple_entities.csv', 'w') as f:
  f.write(multiple_entites)

In [39]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

entityExtractor = TextMatcherInternal() \
    .setInputCols(["document", "token"]) \
    .setEntities("multiple_entities.csv") \
    .setOutputCol("matched_text")\
    .setCaseSensitive(False)\
    .setDelimiter("#")

mathcer_pipeline = Pipeline().setStages([
                  documentAssembler,
                  tokenizer,
                  entityExtractor])

data = spark.createDataFrame([["John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD."]]).toDF("text")

matcher_model = mathcer_pipeline.fit(data)
result = matcher_model.transform(data)

In [40]:
result.select(F.explode(F.arrays_zip(
              result.matched_text.result,
              result.matched_text.begin,
              result.matched_text.end,
              result.matched_text.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|heart condition|   47| 61|Disease|
|  aspirin 100mg|   25| 37|   Drug|
|     tonsilitis|  135|144|Disease|
|    paracetamol|   75| 85|   Drug|
|           GORD|  204|207|Disease|
|    amoxicillin|  115|125|   Drug|
|      ibuprofen|  147|155|   Drug|
|   lansoprazole|  183|194|   Drug|
|          fever|   95| 99|Symptom|
|       headache|  105|112|Symptom|
+---------------+-----+---+-------+



## `TextMatcherInternalModel`

This annotator is an instantiated model of the `TextMatcherInternal`. Once you build an `TextMatcherInternal()`, you can save it and use it with `TextMatcherInternalModel()` via `load()` function. <br/>

Let's re-build one of examples that we have done before and save it.

In [41]:
entityExtractor = TextMatcherInternal() \
    .setInputCols(["document", "token"]) \
    .setEntities("matcher_drug.csv") \
    .setOutputCol("matched_text")\
    .setCaseSensitive(False)\
    .setDelimiter("#")

mathcer_pipeline = Pipeline().setStages([
                  documentAssembler,
                  tokenizer,
                  entityExtractor])

data = spark.createDataFrame([["John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD."]]).toDF("text")

result = mathcer_pipeline.fit(data).transform(data)

Saving the approach to disk

In [42]:
matcher_model.stages[-1].write().overwrite().save("mathcer_internal_model")

Loading the saved model and using it with the `TextMatcherInternalModel()` via `load`.

In [43]:
entity_ruler = TextMatcherInternalModel.load('/content/mathcer_internal_model') \
    .setInputCols(["document", "token"]) \
    .setOutputCol("matched_text")\
    .setCaseSensitive(False)\
    .setDelimiter("#")

pipeline = Pipeline(stages=[documentAssembler,
                            tokenizer,
                            entity_ruler])

empty_data = spark.createDataFrame([[""]]).toDF("text")
pipeline_model = pipeline.fit(data)

Checking the result

In [44]:
result = pipeline_model.transform(data)

result.select(F.explode(F.arrays_zip(result.matched_text.result,
                                    result.matched_text.begin,
                                    result.matched_text.end,
                                    result.matched_text.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|  aspirin 100mg|   25| 37|   Drug|
|    paracetamol|   75| 85|   Drug|
|    amoxicillin|  115|125|   Drug|
|      ibuprofen|  147|155|   Drug|
|   lansoprazole|  183|194|   Drug|
|          fever|   95| 99|Symptom|
|heart condition|   47| 61|Disease|
|     tonsilitis|  135|144|Disease|
|       headache|  105|112|Symptom|
|           GORD|  204|207|Disease|
+---------------+-----+---+-------+



As seen above, we built an `TextMatcherInternal`, saved it and used the saved model with `TextMatcherInternalModel`.

## Using LightPipeline

The TextMatcherInternal annotator can also be applied by using LightPipeline:

In [45]:
light_pipeline = LightPipeline(pipeline_model)

In [46]:
annotations = light_pipeline.fullAnnotate("John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD.")[0]
annotations.keys()

dict_keys(['document', 'token', 'matched_text'])

In [47]:
annotations.get('matched_text')

[Annotation(chunk, 25, 37, aspirin 100mg, {'entity': 'Drug', 'sentence': '0', 'chunk': '0'}, []),
 Annotation(chunk, 75, 85, paracetamol, {'entity': 'Drug', 'sentence': '0', 'chunk': '1'}, []),
 Annotation(chunk, 115, 125, amoxicillin, {'entity': 'Drug', 'sentence': '0', 'chunk': '2'}, []),
 Annotation(chunk, 147, 155, ibuprofen, {'entity': 'Drug', 'sentence': '0', 'chunk': '3'}, []),
 Annotation(chunk, 183, 194, lansoprazole, {'entity': 'Drug', 'sentence': '0', 'chunk': '4'}, []),
 Annotation(chunk, 47, 61, heart condition, {'entity': 'Disease', 'sentence': '0', 'chunk': '5'}, []),
 Annotation(chunk, 135, 144, tonsilitis, {'entity': 'Disease', 'sentence': '0', 'chunk': '6'}, []),
 Annotation(chunk, 204, 207, GORD, {'entity': 'Disease', 'sentence': '0', 'chunk': '7'}, []),
 Annotation(chunk, 95, 99, fever, {'entity': 'Symptom', 'sentence': '0', 'chunk': '8'}, []),
 Annotation(chunk, 105, 112, headache, {'entity': 'Symptom', 'sentence': '0', 'chunk': '9'}, [])]

Display the result with `spark-nlp-display`.

In [48]:
from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

visualiser.display(annotations, label_col='matched_text')

## Pretrained Models

<center>

  <b> Text Matcher Pretrained Models</b>



|index|model|entities|
|----:|:----|-------|
| 1| [drug_matcher](https://nlp.johnsnowlabs.com/2024/02/21/drug_matcher_en.html)  |`DRUG` |
| 2| [biomarker_matcher](https://nlp.johnsnowlabs.com/2024/02/28/biomarker_matcher_en.html)  |`Biomarker` |

</center>

In [49]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

text_matcher = TextMatcherInternalModel.pretrained("drug_matcher","en","clinical/models") \
    .setInputCols(["document", "token"])\
    .setOutputCol("matched_text")

mathcer_pipeline = Pipeline().setStages([
                  documentAssembler,
                  tokenizer,
                  text_matcher])

text = """John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, ciprofloxacin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01."""

data = spark.createDataFrame([[text]]).toDF("text")

result = mathcer_pipeline.fit(data).transform(data)

drug_matcher download started this may take some time.
[OK!]


In [50]:
result.select(F.explode(F.arrays_zip(result.matched_text.result,
                                      result.matched_text.begin,
                                      result.matched_text.end,
                                      result.matched_text.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+-------------+-----+---+-----+
|        chunk|begin|end|label|
+-------------+-----+---+-----+
|      aspirin|   25| 31| DRUG|
|  paracetamol|   69| 79| DRUG|
|ciprofloxacin|  109|121| DRUG|
|    ibuprofen|  143|151| DRUG|
| lansoprazole|  179|190| DRUG|
+-------------+-----+---+-----+



# 📜 EntityRulerInternal

This notebook will cover the different parameter and usage of **EntityRulerInternal**. There are 2 annotators to perform this task in Spark NLP; `EntityRulerInternalApproach` and `EntityRulerInternalModel`. <br/>

This annotator match exact strings or regex patterns provided in a file against a Document and assigns them an named entity. The definitions can contain any number of named entities.

**📖 Learning Objectives:**

1. Understand how to match exact strings or regex patterns by using pre-defined dictionary.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

For extended examples of usage, see the [Spark NLP Workshop]()

Python Documentation: [EntityRulerInternal]()

Scala Documentation: [EntityRulerInternal]()


There are multiple ways and formats to set the extraction resource. It is
   possible to set it either as a "JSON", "JSONL" or "CSV" file. A path to the
   file needs to be provided to ``setPatternsResource``. The file format needs
   to be set as the "format" field in the ``option`` parameter map and
   depending on the file type, additional parameters might need to be set.

**🖨️ Input/Output Annotation Types**
- Input: ``DOCUMENT`` , ``TOKEN``    
- Output: ``CHUNK``

**🔎 Parameters**


- `setPatternsResource` *(str)*: Sets Resource in JSON or CSV format to map entities to patterns.
        path : str
            Path to the resource
        read_as : str, optional
            How to interpret the resource, by default ReadAs.TEXT
        options : dict, optional
            Options for parsing the resource, by default {"format": "JSON"}

- `setSentenceMatch` *(Boolean)*:Whether to find match at sentence level. True: sentence level. False: token level.

- `setAlphabetResource` *(str)*:  Alphabet Resource (a simple plain text with all language characters)

- `setUseStorage` *(Boolean)*:  Sets whether to use RocksDB storage to serialize patterns.





## Keywords Patterns

EntityRulerInternal will handle the chunks output based on the patterns defined, as shown in the example below. We can define an id field to identify entities.

In [51]:
import json

data = [

    {
        "id": "drug-words",
        "label": "Drug",
        "patterns": ["paracetamol", "aspirin", "ibuprofen", "lansoprazol"]
    },
    {
        "id": "disease-words",
        "label": "Disease",
        "patterns": ["heart condition","tonsilitis","GORD"]
    },
        {
        "id": "symptom-words",
        "label": "Symptom",
        "patterns": ["fever","headache"]
    },

]

with open("entities.json", "w") as f:
    json.dump(data, f)

In [52]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

entityRuler = EntityRulerInternalApproach()\
    .setInputCols(["document", "token"])\
    .setOutputCol("entities")\
    .setPatternsResource("entities.json")\
    .setCaseSensitive(False)\

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    entityRuler
])

data = spark.createDataFrame([['''John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.''']]).toDF("text")


result = pipeline.fit(data).transform(data)

Checking the results:

In [53]:
result.select(F.explode(F.arrays_zip(result.entities.result,
                                      result.entities.begin,
                                      result.entities.end,
                                      result.entities.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=30)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|        aspirin|   25| 31|   Drug|
|heart condition|   41| 55|Disease|
|    paracetamol|   69| 79|   Drug|
|          fever|   89| 93|Symptom|
|       headache|   99|106|Symptom|
|     tonsilitis|  129|138|Disease|
|      ibuprofen|  141|149|   Drug|
|    lansoprazol|  177|187|   Drug|
|           GORD|  198|201|Disease|
+---------------+-----+---+-------+



For the CSV file we use the following configuration:


In [54]:
with open('./entities.csv', 'w') as csvfile:
    csvfile.write('SYMPTOM|fever\n')
    csvfile.write('SYMPTOM|headache\n')
    csvfile.write('DRUG|paracetamol\n')
    csvfile.write('DRUG|aspirin\n')
    csvfile.write('DRUG|lansoprazol\n')
    csvfile.write('DRUG|ibuprofen\n')
    csvfile.write('DISEASE|tonsilitis\n')
    csvfile.write('DISEASE|GORD\n')
    csvfile.write('DISEASE|heart condition')

In [55]:
! cat ./entities.csv

SYMPTOM|fever
SYMPTOM|headache
DRUG|paracetamol
DRUG|aspirin
DRUG|lansoprazol
DRUG|ibuprofen
DISEASE|tonsilitis
DISEASE|GORD
DISEASE|heart condition

In [56]:
entity_ruler_csv = EntityRulerInternalApproach() \
    .setInputCols(["document", "token"])\
    .setOutputCol("entities")\
    .setPatternsResource("./entities.csv", options={"format": "csv", "delimiter": "\\|"})

In [57]:
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    entity_ruler_csv
])

data = spark.createDataFrame([['''John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.''']]).toDF("text")

result = pipeline.fit(data).transform(data)

Checking the results:

In [58]:
result.select(F.explode(F.arrays_zip(result.entities.result,
                                      result.entities.begin,
                                      result.entities.end,
                                      result.entities.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=30)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|        aspirin|   25| 31|   DRUG|
|heart condition|   41| 55|DISEASE|
|    paracetamol|   69| 79|   DRUG|
|          fever|   89| 93|SYMPTOM|
|       headache|   99|106|SYMPTOM|
|     tonsilitis|  129|138|DISEASE|
|      ibuprofen|  141|149|   DRUG|
|    lansoprazol|  177|187|   DRUG|
|           GORD|  198|201|DISEASE|
+---------------+-----+---+-------+



## Regex Patterns

As shown in the example below we can define regex pattern to detect entities.

In [59]:
import json

data = [
    {
        "id": "date-regex",
        "label": "Date",
        "patterns": ["\\d{4}-\\d{2}-\\d{2}","\\d{4}"],
        "regex": True
    },
    {
        "id": "drug-words",
        "label": "Drug",
        "patterns": ["paracetamol", "aspirin", "ibuprofen", "lansoprazol"]
    },
    {
        "id": "disease-words",
        "label": "Disease",
        "patterns": ["heart condition","tonsilitis","GORD"]
    },
        {
        "id": "symptom-words",
        "label": "Symptom",
        "patterns": ["fever","headache"]
    },
]

with open("entities.json", "w") as f:
    json.dump(data, f)

In [60]:
entityRuler = EntityRulerInternalApproach()\
    .setInputCols(["document", "token"])\
    .setOutputCol("entities")\
    .setPatternsResource("entities.json")\
    .setCaseSensitive(False)\

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    entityRuler
])

data = spark.createDataFrame([['''John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.''']]).toDF("text")

model = pipeline.fit(data)
result = model.transform(data)

Checking the results:

In [61]:
result.select(F.explode(F.arrays_zip(result.entities.result,
                                      result.entities.begin,
                                      result.entities.end,
                                      result.entities.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=30)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|     2023-12-01|  206|215|   Date|
|        aspirin|   25| 31|   Drug|
|heart condition|   41| 55|Disease|
|    paracetamol|   69| 79|   Drug|
|          fever|   89| 93|Symptom|
|       headache|   99|106|Symptom|
|     tonsilitis|  129|138|Disease|
|      ibuprofen|  141|149|   Drug|
|    lansoprazol|  177|187|   Drug|
|           GORD|  198|201|Disease|
+---------------+-----+---+-------+



## `EntityRulerInternalModel`

This annotator is an instantiated model of the `EntityRulerInternalApproach`. Once you build an `EntityRulerInternalApproach()`, you can save it and use it with `EntityRulerInternalModel()` via `load()` function. <br/>

Let's re-build one of examples that we have done before and save it.

In [62]:
data = spark.createDataFrame([["John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01."]]).toDF("text")
data.show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                                                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.|
+-----------------------------------------------------------------------------------------------------------------------

Saving the approach to disk

In [63]:
empty_data = spark.createDataFrame([[""]]).toDF("text")

pipeline_model = pipeline.fit(empty_data)

pipeline_model.stages[-1].write().overwrite().save("ruler_approach_model")

Loading the saved model and using it with the `EntityRulerInternalModel()` via `load`.

In [64]:
entity_ruler = EntityRulerInternalModel.load('/content/ruler_approach_model') \
    .setInputCols(["document", "token"])\
    .setOutputCol("entities")

pipeline = Pipeline(stages=[documentAssembler,
                            tokenizer,
                            entity_ruler])

result = pipeline.fit(data).transform(data)

In [65]:
result.select(F.explode(F.arrays_zip(result.entities.result,
                                      result.entities.begin,
                                      result.entities.end,
                                      result.entities.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=30)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|     2023-12-01|  206|215|   Date|
|        aspirin|   25| 31|   Drug|
|heart condition|   41| 55|Disease|
|    paracetamol|   69| 79|   Drug|
|          fever|   89| 93|Symptom|
|       headache|   99|106|Symptom|
|     tonsilitis|  129|138|Disease|
|      ibuprofen|  141|149|   Drug|
|    lansoprazol|  177|187|   Drug|
|           GORD|  198|201|Disease|
+---------------+-----+---+-------+



## Using LightPipeline

The EntityRulerInternal annotator can also be applied by using LightPipeline:

In [66]:
light_pipeline = LightPipeline(pipeline_model)

In [67]:
annotations = light_pipeline.fullAnnotate("John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.")[0]
annotations.keys()

dict_keys(['document', 'token', 'entities'])

In [68]:
annotations.get('entities')

[Annotation(chunk, 206, 215, 2023-12-01, {'entity': 'Date', 'id': 'date-regex', 'sentence': '0'}, []),
 Annotation(chunk, 25, 31, aspirin, {'entity': 'Drug', 'sentence': '0', 'id': 'drug-words'}, []),
 Annotation(chunk, 41, 55, heart condition, {'entity': 'Disease', 'sentence': '0', 'id': 'disease-words'}, []),
 Annotation(chunk, 69, 79, paracetamol, {'entity': 'Drug', 'sentence': '0', 'id': 'drug-words'}, []),
 Annotation(chunk, 89, 93, fever, {'entity': 'Symptom', 'sentence': '0', 'id': 'symptom-words'}, []),
 Annotation(chunk, 99, 106, headache, {'entity': 'Symptom', 'sentence': '0', 'id': 'symptom-words'}, []),
 Annotation(chunk, 129, 138, tonsilitis, {'entity': 'Disease', 'sentence': '0', 'id': 'disease-words'}, []),
 Annotation(chunk, 141, 149, ibuprofen, {'entity': 'Drug', 'sentence': '0', 'id': 'drug-words'}, []),
 Annotation(chunk, 177, 187, lansoprazol, {'entity': 'Drug', 'sentence': '0', 'id': 'drug-words'}, []),
 Annotation(chunk, 198, 201, GORD, {'entity': 'Disease', 'sent

Display the result with `spark-nlp-display`.

In [69]:
from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

visualiser.display(annotations, label_col='entities')