![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/01.6.Rule_Based_Entity_Matchers.ipynb)



## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=True
nlp.install(refresh_install=True)

In [None]:
from johnsnowlabs import nlp, medical
# Automatically load license data and start a session with all jars user has access to

spark = nlp.start()

In [None]:
spark

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

import pandas as pd
import json
import string
import numpy as np
import warnings
warnings.filterwarnings('ignore')

## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

#   **📜 RegexMatcher**

The **`RegexMatcher`** class implements an  annotator approach to match a set of regular expressions with a provided entity. This approach is utilized for associating specific patterns within text data with predetermined entities, such as dates, mentioned within the text.

The class allows users to define rules using regular expressions paired with entities, offering flexibility in customization. These rules can either be directly set using the `setRules` method, with a specified delimiter, or loaded from an external file using the `setExternalRules` method.

Additionally, users can specify parameters such as the matching strategy (`MATCH_FIRST`, `MATCH_ALL`, or `MATCH_COMPLETE`) to control how matches are handled. The output annotation type is `CHUNK`, with input annotation types supporting `DOCUMENT`. This class provides a versatile tool for implementing entity recognition based on user-defined patterns within text data.


A rule consists of a regex pattern and an identifier, delimited by a character of choice. An example could be `"\\d{4}\\/\\d\\d\\/\\d\\d,date"` which will match strings like `"1970/01/01"` to the identifier `"date"`.

**🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`

- Output: `CHUNK`

**🔎 Parameters**

**Parameters**:

- `strategy`: Can be either `MATCH_FIRST`, `MATCH_ALL`, `MATCH_COMPLETE`, by default `MATCH_ALL`.
- `rules`: Regex rules to match the identifier with.
- `delimiter`: Delimiter for rules provided with setRules.
- `externalRules`: external resource to rules, needs `delimiter` in options.

### IP and DATE

In [None]:
text = """Name : Hendrickson, Ora, Record date: 2093-01-13, MR #719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no: A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""

data = spark.createDataFrame([[text]]).toDF("text")

In [None]:
!mkdir -p rules

rules = '''
(\d{1,3}\.){3}\d{1,3}~IPADDR
\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{2}|\d{2}/\d{2}/\d{2}~DATE
'''

with open('./rules/regex_rules.txt', 'w') as f:
    f.write(rules)

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

regex_matcher = medical.RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='./rules/regex_rules.txt', delimiter='~')

nlpPipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        regex_matcher
])

result = nlpPipeline.fit(data).transform(data)

In [None]:
result.select('regex_matches.result','regex_matches.metadata').show(truncate=70)

+--------------------------------------+----------------------------------------------------------------------+
|                                result|                                                              metadata|
+--------------------------------------+----------------------------------------------------------------------+
|[2093-01-13, 203.120.223.13, 01/13/93]|[{entity -> DATE, ner_source -> regex_matches, chunk -> 0, sentence...|
+--------------------------------------+----------------------------------------------------------------------+



In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.regex_matches.result,
                                                 result.regex_matches.begin,
                                                 result.regex_matches.end,
                                                 result.regex_matches.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("regex_result"),
                          F.expr("cols['1']").alias("begin"),
                          F.expr("cols['2']").alias("end"),
                          F.expr("cols['3']['entity']").alias("ner_label"))
result_df.show()

+--------------+-----+---+---------+
|  regex_result|begin|end|ner_label|
+--------------+-----+---+---------+
|    2093-01-13|   38| 47|     DATE|
|203.120.223.13|   97|110|   IPADDR|
|      01/13/93|  188|195|     DATE|
+--------------+-----+---+---------+



### AGE

- `(?i)(?<=((age of|age)))(\d{1,3})~AGE` this rule inludes prefixes for `age of` and `age`
- `(\d{1,3})(?i)(?=(([^a-z0-9]{0,3}(-years-old|years-old|-year-old))))~AGE` this rule includes suffixes for `-years-old`, `years-old`, and `-year-old`


In [None]:
regex_rules_age = """
(?i)(?<=((age of|age)))(\d{1,3})~AGE
(\d{1,3})(?i)(?=(([^a-z0-9]{0,3}(-years-old|years-old|-year-old))))~AGE
"""
with open('./rules/age_rules.txt', 'w') as f:
    f.write(regex_rules_age)

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

regex_matcher = medical.RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules('./rules/age_rules.txt', delimiter='~')


nlpPipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        regex_matcher
])

result = nlpPipeline.fit(data).transform(data)

In [None]:
result.select('regex_matches.result','regex_matches.metadata').show(truncate=70)


+------+----------------------------------------------------------------------+
|result|                                                              metadata|
+------+----------------------------------------------------------------------+
|  [60]|[{entity -> AGE, ner_source -> regex_matches, chunk -> 0, sentence ...|
+------+----------------------------------------------------------------------+



In [None]:
result = result.select(F.explode(F.arrays_zip(result.regex_matches.result,
                                              result.regex_matches.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("regex_result"),
                          F.expr("cols['1']['entity']").alias("ner_label")).show()

+------------+---------+
|regex_result|ner_label|
+------------+---------+
|          60|      AGE|
+------------+---------+



## Create Custom Regex Rules

In [None]:
def write_rule(file_path, rules):

    prefix_rule = "(?i)(?<=((({})[^a-z0-9]{})))"
    prefix_rule_init = "(?i)(?<=(({})))"
    suffix_rule = "(?i)(?=(([^a-z0-9]{}({}))))"
    suffix_rule_init = "(?i)(?=(({})))"
    with open(file_path, 'w') as f:

        for label in list(rules.keys()):
            if len(rules[label]['prefix'])>0:
                rule = prefix_rule_init.format("|".join(rules[label]['prefix'])) + rules[label]['rule'] + f"~{label}"
                f.write(rule)
                f.write('\n')
                for i in range(1,rules[label]['contextLength']):
                    rule = prefix_rule.format("|".join(rules[label]['prefix']),'{'+str(i)+'}') + rules[label]['rule'] + f"~{label}"
                    f.write(rule)
                    f.write('\n')
            try:

                if len(rules[label]['suffix'])>0:
                    rule = suffix_rule_init.format("|".join(rules[label]['suffix'])) + rules[label]['rule'] + f"~{label}"
                    f.write(rule)
                    f.write('\n')
                    for i in range(1,rules[label]['contextLength']):
                        rule = rules[label]['rule'] + suffix_rule.format('{'+str(i)+'}', "|".join(rules[label]['suffix'])) + f"~{label}"
                        f.write(rule)
                        f.write('\n')
            except:
                continue

In [None]:
!mkdir -p rules regex_models

### SSN

In [None]:
rule_path = "rules/ssn_regex_rule.txt"
model_path = "regex_models/ssn_regex_parser_model"
context_length = 3

regex_rules_ssn = {
    'SSN' :
        {
            'rule' : '(\d{3}-\d{2}-\d{4})',
            'prefix' : [
                "social", "security", "ss#", "ssn#","ssid", "ss #", "ssn #", "SSA Number", "social security number",
                "social security #", "social security#", "social security no","Soc Sec", "SSN", "SSNS", "SSN#", "SS#", "SSID"
                ],
            'label' : 'SSN',
            'contextLength' : context_length
        }
}

write_rule(rule_path, regex_rules_ssn)

ssn_regex_matcher = medical.RegexMatcher()\
    .setExternalRules(rule_path,  "~") \
    .setInputCols(["document"]) \
    .setOutputCol("ssn_matched_text") \
    .setStrategy("MATCH_ALL")

regex_parser_pipeline = nlp.Pipeline(
    stages=[
      document_assembler,
      ssn_regex_matcher
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

regex_parser_model = regex_parser_pipeline.fit(empty_data)
regex_parser_model.stages[-1].write().overwrite().save(model_path)

### AGE

In [None]:
rule_path = "rules/age_regex_rule.txt"
model_path = "regex_models/age_regex_parser_model"

with open(rule_path, 'w') as f:
    f.write("""(?i)(?<=((age of|age)))(\d{1,3})~AGE
               (\d{1,3})(?i)(?=(([^a-z0-9]{0,3}(-years-old|years-old|-year-old))))~AGE""")


age_regex_matcher = medical.RegexMatcher()\
    .setExternalRules(rule_path,  "~") \
    .setInputCols(["document"]) \
    .setOutputCol("age_matched_text") \
    .setStrategy("MATCH_ALL")

regex_parser_pipeline = nlp.Pipeline(
    stages=[
      document_assembler,
      age_regex_matcher
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

regex_parser_model = regex_parser_pipeline.fit(empty_data)
regex_parser_model.stages[-1].write().overwrite().save(model_path)

### MAIL

In [None]:
rule_path = "rules/mail_regex_rule.txt"
model_path = "regex_models/mail_regex_parser_model"

with open(rule_path, 'w') as f:
    f.write("""[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}~EMAIL""")


mail_regex_matcher = medical.RegexMatcher()\
    .setExternalRules(rule_path,  "~") \
    .setInputCols(["document"]) \
    .setOutputCol("mail_matched_text") \
    .setStrategy("MATCH_ALL")

regex_parser_pipeline = nlp.Pipeline(
    stages=[
      document_assembler,
      mail_regex_matcher
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

regex_parser_model = regex_parser_pipeline.fit(empty_data)
regex_parser_model.stages[-1].write().overwrite().save(model_path)

### PHONE

In [None]:
rule_path = "rules/phone_regex_rule.txt"
model_path = "regex_models/phone_regex_parser_model"

with open(rule_path, 'w') as f:
    f.write("""\(\d{3}\) \d{3}-\d{4}~PHONE""")


phone_regex_matcher = medical.RegexMatcher()\
    .setExternalRules(rule_path,  "~") \
    .setInputCols(["document"]) \
    .setOutputCol("phone_matched_text") \
    .setStrategy("MATCH_ALL")

regex_parser_pipeline = nlp.Pipeline(
    stages=[
      document_assembler,
      phone_regex_matcher
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

regex_parser_model = regex_parser_pipeline.fit(empty_data)
regex_parser_model.stages[-1].write().overwrite().save(model_path)

## RegexMatcherModel

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

regex_matcher_ssn = medical.RegexMatcherModel.load("regex_models/ssn_regex_parser_model")\
    .setInputCols(["sentence"]) \
    .setOutputCol("ssn_matched_text")

regex_matcher_age = nlp.RegexMatcherModel.load("regex_models/age_regex_parser_model")\
    .setInputCols(["sentence"]) \
    .setOutputCol("age_matched_text")

regex_matcher_mail = nlp.RegexMatcherModel.load("regex_models/mail_regex_parser_model")\
    .setInputCols(["sentence"]) \
    .setOutputCol("mail_matched_text")

regex_matcher_phone = nlp.RegexMatcherModel.load("regex_models/phone_regex_parser_model")\
    .setInputCols(["sentence"]) \
    .setOutputCol("phone_matched_text")

chunk_merge = medical.ChunkMergeApproach()\
      .setInputCols("ssn_matched_text",
                    "age_matched_text",
                    "mail_matched_text",
                    "phone_matched_text")\
      .setOutputCol("ner_chunk")\
      .setMergeOverlapping(True)\
      .setChunkPrecedence("field")

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]


In [None]:
nlpPipeline = nlp.Pipeline(stages=[
      document_assembler,
      sentence_detector,
      tokenizer,
      regex_matcher_ssn,
      regex_matcher_age,
      regex_matcher_mail,
      regex_matcher_phone,
      chunk_merge
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

regex_pipeline_model = nlpPipeline.fit(empty_data)
light_model = nlp.LightPipeline(regex_pipeline_model)

## Using LightPipeline

In [None]:
text = """Name : Hendrickson, Ora, Record date: 2093-01-13, MR #719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13.
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no: A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""

result = light_model.fullAnnotate(text)

In [None]:
result[0].keys()

dict_keys(['document', 'ner_chunk', 'phone_matched_text', 'age_matched_text', 'ssn_matched_text', 'token', 'sentence', 'mail_matched_text'])

In [None]:
result[0]['ner_chunk']

[Annotation(chunk, 121, 122, 60, {'entity': 'AGE', 'ner_source': 'age_matched_text', 'chunk': '0', 'sentence': '2'}, []),
 Annotation(chunk, 239, 249, 333-44-6666, {'entity': 'SSN', 'ner_source': 'ssn_matched_text', 'chunk': '1', 'sentence': '3'}, []),
 Annotation(chunk, 289, 302, (302) 786-5227, {'entity': 'PHONE', 'ner_source': 'phone_matched_text', 'chunk': '2', 'sentence': '4'}, []),
 Annotation(chunk, 347, 361, smith@gmail.com, {'entity': 'EMAIL', 'ner_source': 'mail_matched_text', 'chunk': '3', 'sentence': '4'}, [])]

In [None]:
ner_chunk = []
ner_label = []
begin = []
end = []

for n in result[0]['ner_chunk']:

    begin.append(n.begin)
    end.append(n.end)
    ner_chunk.append(n.result)
    ner_label.append(n.metadata['entity'])


import pandas as pd

df = pd.DataFrame({'ner_chunk':ner_chunk, 'begin': begin, 'end':end,
                   'ner_label':ner_label})

df

Unnamed: 0,ner_chunk,begin,end,ner_label
0,60,121,122,AGE
1,333-44-6666,239,249,SSN
2,(302) 786-5227,289,302,PHONE
3,smith@gmail.com,347,361,EMAIL


## Transform

In [None]:
empty_data = spark.createDataFrame([[text]]).toDF("text")

result = regex_pipeline_model.transform(empty_data)
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|    ssn_matched_text|    age_matched_text|   mail_matched_text|  phone_matched_text|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Name : Hendrickso...|[{document, 0, 36...|[{document, 0, 60...|[{token, 0, 3, Na...|[{chunk, 239, 249...|[{chunk, 121, 122...|[{chunk, 347, 361...|[{chunk, 289, 302...|[{chunk, 121, 122...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [None]:
result = result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                              result.ner_chunk.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("ner_chunk"),
                          F.expr("cols['1']['entity']").alias("ner_label")).show()

+---------------+---------+
|      ner_chunk|ner_label|
+---------------+---------+
|             60|      AGE|
|    333-44-6666|      SSN|
| (302) 786-5227|    PHONE|
|smith@gmail.com|    EMAIL|
+---------------+---------+



**Save and Load Pipeline**

In [None]:
regex_pipeline_model.write().overwrite().save("regex_pipeline_model")

In [None]:
regex_pipeline_loaded = nlp.PretrainedPipeline.from_disk("regex_pipeline_model")

In [None]:
result = regex_pipeline_loaded.fullAnnotate(text)
result[0].keys()

dict_keys(['document', 'ner_chunk', 'phone_matched_text', 'age_matched_text', 'ssn_matched_text', 'token', 'sentence', 'mail_matched_text'])

In [None]:
ner_chunk = []
ner_label = []
begin = []
end = []

for n in result[0]['ner_chunk']:

    begin.append(n.begin)
    end.append(n.end)
    ner_chunk.append(n.result)
    ner_label.append(n.metadata['entity'])


import pandas as pd

df = pd.DataFrame({'ner_chunk':ner_chunk,
                   'begin': begin, 'end':end,
                   'ner_label':ner_label})

df

Unnamed: 0,ner_chunk,begin,end,ner_label
0,60,121,122,AGE
1,333-44-6666,239,249,SSN
2,(302) 786-5227,289,302,PHONE
3,smith@gmail.com,347,361,EMAIL


## Pretrained Models

<center>

  <b>Regex Matcher Pretrained Models</b>

|index|model|
|----:|:----|
| 1| [date_matcher](https://nlp.johnsnowlabs.com/2024/10/23/date_matcher_en.html)  |
| 2| [email_matcher](https://nlp.johnsnowlabs.com/2024/08/21/email_matcher_en.html)  |
| 3| [phone_matcher](https://nlp.johnsnowlabs.com/2024/06/19/phone_matcher_en.html) |
| 4| [state_matcher](https://nlp.johnsnowlabs.com/2024/06/19/state_matcher_en.html) |
| 5| [zip_matcher](https://nlp.johnsnowlabs.com/2024/06/19/zip_matcher_en.html) |
| 6| [url_matcher](https://nlp.johnsnowlabs.com/2024/08/21/url_matcher_en.html) |
| 7| [ip_matcher](https://nlp.johnsnowlabs.com/2024/08/21/ip_matcher_en.html) |

</center>

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

date_regex_matcher = medical.RegexMatcherModel.pretrained("date_matcher","en","clinical/models") \
    .setInputCols(["sentence"]) \
    .setOutputCol("date_parser")

parserPipeline = nlp.Pipeline(stages=[
        document_assembler,
        sentence_detector,
        date_regex_matcher
        ])

model = parserPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

sample_text = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13.
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, VIN 4Y1SL65848Z411439, VIN 1HGCM82633A123456 - VIN JH4KA7560MC012345 - VIN 5YJSA1E14HF123456
Driver's license no: A334455B, plate 34NLP34. Lic: 12345As. Cert: 12345As
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com"""

result = model.transform(spark.createDataFrame([[sample_text]]).toDF("text"))

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
date_matcher download started this may take some time.
Approximate size to download 5 KB
[OK!]


In [None]:
result.select(F.explode(F.arrays_zip(result.date_parser.result,
                                      result.date_parser.begin,
                                      result.date_parser.end,
                                      result.date_parser.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+----------+-----+---+-----+
|     chunk|begin|end|label|
+----------+-----+---+-----+
|2093-01-13|   38| 47| DATE|
|  01/13/93|  187|194| DATE|
+----------+-----+---+-----+



# 📜 TextMatcher

In this notebook, we will examine the `TextMatcher` annotator and its model version `TextMatcherModel`.

This annotator match exact phrases provided in a file against a
Document.


**📖 Learning Objectives:**

1. Understand how to match exact phrases by using pre-defined dictionary.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

For extended examples of usage, see the [Spark NLP Workshop Repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp)

Python Documentation: [TextMatcher](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/matcher/text_matcher/index.html#sparknlp.annotator.matcher.text_matcher.TextMatcher)

Scala Documentation: [TextMatcher](https://sparknlp.org/api/com/johnsnowlabs/nlp/annotators/TextMatcher)


**🖨️ Input/Output Annotation Types**
- Input: ``DOCUMENT`` , ``TOKEN``    
- Output: ``CHUNK``

**🔎 Parameters**


- `setEntities` *(str)*: Sets the external resource for the entities.
        path : str
            Path to the external resource
        read_as : str, optional
            How to read the resource, by default ReadAs.TEXT
        options : dict, optional
            Options for reading the resource, by default {"format": "text"}
- `setCaseSensitive` *(Boolean)*: Sets whether to match regardless of case. (Default: True)

- `setMergeOverlapping` *(Boolean)*:Sets whether to merge overlapping matched chunks. (Default: False)

- `setEntityValue` *(str)*: Sets the value for the entity metadata field. If any entity value isn't set in the file, we need to set it for the entity value.

- `setBuildFromTokens` *(Boolean)*:  Sets whether the TextMatcher  should take the CHUNK from TOKEN.

- `setDelimiter` *(str)*:  Sets Value for the delimiter between Phrase, Entity.





## How to Use `TextMatcher`

First of all, we should create a source file that includes all the chunks or tokens we need to capture. In the example below, we use `#` as a delimiter to separate the label and entity. So we need to set parameter like this `setDelimiter('#')`.

In [None]:
matcher_drug = """
Aspirin 100mg#Drug
aspirin#Drug
paracetamol#Drug
amoxicillin#Drug
ibuprofen#Drug
lansoprazole#Drug
"""

with open ('matcher_drug.csv', 'w') as f:
  f.write(matcher_drug)

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

entityExtractor = medical.TextMatcher()\
    .setInputCols(["document", "token"])\
    .setEntities("matcher_drug.csv")\
    .setOutputCol("matched_text")\
    .setCaseSensitive(False)\
    .setDelimiter("#")\
    .setMergeOverlapping(False)

mathcer_pipeline = nlp.Pipeline().setStages([
                  documentAssembler,
                  tokenizer,
                  entityExtractor])

data = spark.createDataFrame([["John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD."]]).toDF("text")

result = mathcer_pipeline.fit(data).transform(data)

In [None]:
result.select(F.explode(F.arrays_zip(
              result.matched_text.result,
              result.matched_text.begin,
              result.matched_text.end,
              result.matched_text.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+-------------+-----+---+-----+
|        chunk|begin|end|label|
+-------------+-----+---+-----+
|aspirin 100mg|   25| 37| Drug|
|  amoxicillin|  102|112| Drug|
| lansoprazole|  170|181| Drug|
|  paracetamol|   75| 85| Drug|
|      aspirin|   25| 31| Drug|
|    ibuprofen|  134|142| Drug|
+-------------+-----+---+-----+



As you see above mather_drug file includes 2 similar entities aspirin and aspirin 100mg and our text includes both of them So if you want to see both of them you need to set `MergeOverlapping` parameter as `False`. You can look at the below example.

In [None]:
entityExtractor = medical.TextMatcher()\
    .setInputCols(["document", "token"])\
    .setEntities("matcher_drug.csv") \
    .setOutputCol("matched_text")\
    .setDelimiter("#")\
    .setCaseSensitive(False)\
    .setMergeOverlapping(False)

mathcer_pipeline = nlp.Pipeline().setStages([
                  documentAssembler,
                  tokenizer,
                  entityExtractor])

result = mathcer_pipeline.fit(data).transform(data)

In [None]:
result.select(F.explode(F.arrays_zip(
              result.matched_text.result,
              result.matched_text.begin,
              result.matched_text.end,
              result.matched_text.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+-------------+-----+---+-----+
|        chunk|begin|end|label|
+-------------+-----+---+-----+
|aspirin 100mg|   25| 37| Drug|
|  amoxicillin|  102|112| Drug|
| lansoprazole|  170|181| Drug|
|  paracetamol|   75| 85| Drug|
|      aspirin|   25| 31| Drug|
|    ibuprofen|  134|142| Drug|
+-------------+-----+---+-----+



When we set the `CaseSensitive` parameter to `True`, it means we're considering the case sensitivity of chunks in the source file. Consequently, some chunks may not be visible due to differences in their case compared to the source file.

In [None]:
entityExtractor = medical.TextMatcher()\
    .setInputCols(["document", "token"])\
    .setEntities("matcher_drug.csv") \
    .setOutputCol("matched_text")\
    .setDelimiter("#")\
    .setCaseSensitive(True)\
    .setMergeOverlapping(False)

mathcer_pipeline = nlp.Pipeline().setStages([
                  documentAssembler,
                  tokenizer,
                  entityExtractor])

matcher_model = mathcer_pipeline.fit(data)
result = matcher_model.transform(data)

In [None]:
result.select(F.explode(F.arrays_zip(
              result.matched_text.result,
              result.matched_text.begin,
              result.matched_text.end,
              result.matched_text.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+------------+-----+---+-----+
|       chunk|begin|end|label|
+------------+-----+---+-----+
| amoxicillin|  102|112| Drug|
|lansoprazole|  170|181| Drug|
| paracetamol|   75| 85| Drug|
|     aspirin|   25| 31| Drug|
|   ibuprofen|  134|142| Drug|
+------------+-----+---+-----+



## Multiple Entities

We can set multiple entities in the same file.

In [None]:
multiple_entites= """
Aspirin 100mg#Drug
paracetamol#Drug
amoxicillin#Drug
ibuprofen#Drug
lansoprazole#Drug
fever#Symptom
headache#Symptom
tonsilitis#Disease
GORD#Disease
heart condition#Disease
"""

with open ('multiple_entities.csv', 'w') as f:
  f.write(multiple_entites)

In [None]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

entityExtractor = medical.TextMatcher() \
    .setInputCols(["document", "token"]) \
    .setEntities("multiple_entities.csv") \
    .setOutputCol("matched_text")\
    .setDelimiter("#")\
    .setCaseSensitive(False)\
    .setDelimiter("#")

matcher_pipeline = nlp.Pipeline().setStages([
                  documentAssembler,
                  tokenizer,
                  entityExtractor])

data = spark.createDataFrame([["John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD."]]).toDF("text")

matcher_model = matcher_pipeline.fit(data)
result = matcher_model.transform(data)

In [None]:
result.select(F.explode(F.arrays_zip(
              result.matched_text.result,
              result.matched_text.begin,
              result.matched_text.end,
              result.matched_text.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|  aspirin 100mg|   25| 37|   Drug|
|    amoxicillin|  115|125|   Drug|
|   lansoprazole|  183|194|   Drug|
|    paracetamol|   75| 85|   Drug|
|      ibuprofen|  147|155|   Drug|
|          fever|   95| 99|Symptom|
|       headache|  105|112|Symptom|
|heart condition|   47| 61|Disease|
|     tonsilitis|  135|144|Disease|
|           GORD|  204|207|Disease|
+---------------+-----+---+-------+



## `TextMatcherModel`

This annotator is an instantiated model of the `TextMatcher`. Once you build an `TextMatcher()`, you can save it and use it with `TextMatcherModel()` via `load()` function. <br/>

Let's re-build one of examples that we have done before and save it.

In [None]:
entityExtractor = medical.TextMatcher() \
    .setInputCols(["document", "token"]) \
    .setEntities("matcher_drug.csv") \
    .setOutputCol("matched_text")\
    .setCaseSensitive(False)\
    .setDelimiter("#")

matcher_pipeline = nlp.Pipeline().setStages([
                  documentAssembler,
                  tokenizer,
                  entityExtractor])

data = spark.createDataFrame([["John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD."]]).toDF("text")

result = matcher_pipeline.fit(data).transform(data)

Saving the approach to disk

In [None]:
matcher_model.stages[-1].write().overwrite().save("matcher_model")

Loading the saved model and using it with the `TextMatcherModel()` via `load`.

In [None]:
entity_ruler = nlp.TextMatcherModel.load('/content/matcher_model') \
    .setInputCols(["document", "token"]) \
    .setOutputCol("matched_text")\


pipeline = nlp.Pipeline(stages=[documentAssembler,
                            tokenizer,
                            entity_ruler])

empty_data = spark.createDataFrame([[""]]).toDF("text")
pipeline_model = pipeline.fit(data)

Checking the result

In [None]:
result = pipeline_model.transform(data)

result.select(F.explode(F.arrays_zip(result.matched_text.result,
                                    result.matched_text.begin,
                                    result.matched_text.end,
                                    result.matched_text.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|heart condition|   47| 61|Disease|
|     tonsilitis|  135|144|Disease|
|           GORD|  204|207|Disease|
|  aspirin 100mg|   25| 37|   Drug|
|    amoxicillin|  115|125|   Drug|
|   lansoprazole|  183|194|   Drug|
|    paracetamol|   75| 85|   Drug|
|      ibuprofen|  147|155|   Drug|
|          fever|   95| 99|Symptom|
|       headache|  105|112|Symptom|
+---------------+-----+---+-------+



As seen above, we built an `TextMatcher`, saved it and used the saved model with `TextMatcherModel`.

## Using LightPipeline

The TextMatcher annotator can also be applied by using LightPipeline:

In [None]:
light_pipeline = nlp.LightPipeline(pipeline_model)

In [None]:
annotations = light_pipeline.fullAnnotate("John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD.")[0]
annotations.keys()

dict_keys(['document', 'token', 'matched_text'])

In [None]:
annotations.get('matched_text')

[Annotation(chunk, 47, 61, heart condition, {'chunk': '0', 'ner_source': 'matched_text', 'original_or_matched': 'heart condition', 'entity': 'Disease', 'sentence': '0'}, []),
 Annotation(chunk, 135, 144, tonsilitis, {'chunk': '1', 'ner_source': 'matched_text', 'original_or_matched': 'tonsilitis', 'entity': 'Disease', 'sentence': '0'}, []),
 Annotation(chunk, 204, 207, GORD, {'chunk': '2', 'ner_source': 'matched_text', 'original_or_matched': 'GORD', 'entity': 'Disease', 'sentence': '0'}, []),
 Annotation(chunk, 25, 37, aspirin 100mg, {'chunk': '3', 'ner_source': 'matched_text', 'original_or_matched': 'Aspirin 100mg', 'entity': 'Drug', 'sentence': '0'}, []),
 Annotation(chunk, 115, 125, amoxicillin, {'chunk': '4', 'ner_source': 'matched_text', 'original_or_matched': 'amoxicillin', 'entity': 'Drug', 'sentence': '0'}, []),
 Annotation(chunk, 183, 194, lansoprazole, {'chunk': '5', 'ner_source': 'matched_text', 'original_or_matched': 'lansoprazole', 'entity': 'Drug', 'sentence': '0'}, []),
 

Display the result with `spark-nlp-display`.

In [None]:
visualiser = nlp.viz.NerVisualizer()

visualiser.display(annotations, label_col='matched_text')

## Pretrained Models

<center>

  <b>Text Matcher Pretrained Models</b>

|index|model|entities|
|----:|:----|-------|
| 1| [drug_matcher](https://nlp.johnsnowlabs.com/2024/03/06/drug_matcher_en.html)  |`DRUG` |
| 2| [biomarker_matcher](https://nlp.johnsnowlabs.com/2024/03/06/biomarker_matcher_en.html)  |`Biomarker` |
| 3| [country_matcher](https://nlp.johnsnowlabs.com/2024/09/25/country_matcher_en.html) |`Country` |
| 4| [state_matcher](https://nlp.johnsnowlabs.com/2024/09/25/state_matcher_en.html) |`STATE` |
| 5| [country_matcher](https://nlp.johnsnowlabs.com/2024/10/23/country_matcher_en.html) |`COUNTRY` |


</center>

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

text_matcher = medical.TextMatcherModel.pretrained("drug_matcher","en","clinical/models") \
    .setInputCols(["document", "token"])\
    .setOutputCol("matched_text")

mathcer_pipeline = nlp.Pipeline().setStages([
                  documentAssembler,
                  tokenizer,
                  text_matcher])

text = """John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, ciprofloxacin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01."""

data = spark.createDataFrame([[text]]).toDF("text")

result = mathcer_pipeline.fit(data).transform(data)

drug_matcher download started this may take some time.
Approximate size to download 5.5 MB
[OK!]


In [None]:
result.select(F.explode(F.arrays_zip(result.matched_text.result,
                                      result.matched_text.begin,
                                      result.matched_text.end,
                                      result.matched_text.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=70)

+-------------+-----+---+-----+
|        chunk|begin|end|label|
+-------------+-----+---+-----+
|      aspirin|   25| 31| DRUG|
|  paracetamol|   69| 79| DRUG|
|ciprofloxacin|  109|121| DRUG|
|    ibuprofen|  143|151| DRUG|
| lansoprazole|  179|190| DRUG|
+-------------+-----+---+-----+



# 📜 EntityRuler

This notebook will cover the different parameter and usage of **EntityRuler**. There are 2 annotators to perform this task in Spark NLP; `EntityRulerApproach` and `EntityRulerModel`. <br/>

This annotator match exact strings or regex patterns provided in a file against a Document and assigns them an named entity. The definitions can contain any number of named entities.

**📖 Learning Objectives:**

1. Understand how to match exact strings or regex patterns by using pre-defined dictionary.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

For extended examples of usage, see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp)

Python Documentation: [EntityRuler](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/er/entity_ruler/index.html#sparknlp.annotator.er.entity_ruler.EntityRulerApproach)

Scala Documentation: [EntityRuler](https://sparknlp.org/api/com/johnsnowlabs/nlp/annotators/er/EntityRulerApproach)


There are multiple ways and formats to set the extraction resource. It is
   possible to set it either as a "JSON", "JSONL" or "CSV" file. A path to the
   file needs to be provided to ``setPatternsResource``. The file format needs
   to be set as the "format" field in the ``option`` parameter map and
   depending on the file type, additional parameters might need to be set.

**🖨️ Input/Output Annotation Types**
- Input: ``DOCUMENT`` , ``TOKEN``    
- Output: ``CHUNK``

**🔎 Parameters**


- `setPatternsResource` *(str)*: Sets Resource in JSON or CSV format to map entities to patterns.
        path : str
            Path to the resource
        read_as : str, optional
            How to interpret the resource, by default ReadAs.TEXT
        options : dict, optional
            Options for parsing the resource, by default {"format": "JSON"}

- `setSentenceMatch` *(Boolean)*:Whether to find match at sentence level. True: sentence level. False: token level.

- `setAlphabetResource` *(str)*:  Alphabet Resource (a simple plain text with all language characters)

- `setUseStorage` *(Boolean)*:  Sets whether to use RocksDB storage to serialize patterns.





## Keywords Patterns

EntityRuler will handle the chunks output based on the patterns defined, as shown in the example below. We can define an id field to identify entities.

In [None]:
import json

data = [

    {
        "id": "drug-words",
        "label": "Drug",
        "patterns": ["paracetamol", "aspirin", "ibuprofen", "lansoprazol"]
    },
    {
        "id": "disease-words",
        "label": "Disease",
        "patterns": ["heart condition","tonsilitis","GORD"]
    },
        {
        "id": "symptom-words",
        "label": "Symptom",
        "patterns": ["fever","headache"]
    },

]

with open("entities.json", "w") as f:
    json.dump(data, f)

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

entityRuler = medical.EntityRulerApproach()\
    .setInputCols(["document", "token"])\
    .setOutputCol("entities")\
    .setPatternsResource("entities.json")\
    .setCaseSensitive(False)\

pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    tokenizer,
    entityRuler
])

data = spark.createDataFrame([['''John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.''']]).toDF("text")


result = pipeline.fit(data).transform(data)

Checking the results:

In [None]:
result.select(F.explode(F.arrays_zip(result.entities.result,
                                      result.entities.begin,
                                      result.entities.end,
                                      result.entities.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=30)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|        aspirin|   25| 31|   Drug|
|heart condition|   41| 55|Disease|
|    paracetamol|   69| 79|   Drug|
|          fever|   89| 93|Symptom|
|       headache|   99|106|Symptom|
|     tonsilitis|  129|138|Disease|
|      ibuprofen|  141|149|   Drug|
|    lansoprazol|  177|187|   Drug|
|           GORD|  198|201|Disease|
+---------------+-----+---+-------+



For the CSV file we use the following configuration:


In [None]:
with open('./entities.csv', 'w') as csvfile:
    csvfile.write('SYMPTOM|fever\n')
    csvfile.write('SYMPTOM|headache\n')
    csvfile.write('DRUG|paracetamol\n')
    csvfile.write('DRUG|aspirin\n')
    csvfile.write('DRUG|lansoprazol\n')
    csvfile.write('DRUG|ibuprofen\n')
    csvfile.write('DISEASE|tonsilitis\n')
    csvfile.write('DISEASE|GORD\n')
    csvfile.write('DISEASE|heart condition')

In [None]:
! cat ./entities.csv

SYMPTOM|fever
SYMPTOM|headache
DRUG|paracetamol
DRUG|aspirin
DRUG|lansoprazol
DRUG|ibuprofen
DISEASE|tonsilitis
DISEASE|GORD
DISEASE|heart condition

In [None]:
entity_ruler_csv = nlp.EntityRulerApproach() \
    .setInputCols(["document", "token"])\
    .setOutputCol("entities")\
    .setPatternsResource("./entities.csv", options={"format": "csv", "delimiter": "\\|"})

In [None]:
pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    tokenizer,
    entity_ruler_csv
])

data = spark.createDataFrame([['''John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.''']]).toDF("text")

result = pipeline.fit(data).transform(data)

Checking the results:

In [None]:
result.select(F.explode(F.arrays_zip(result.entities.result,
                                      result.entities.begin,
                                      result.entities.end,
                                      result.entities.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=30)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|        aspirin|   25| 31|   DRUG|
|heart condition|   41| 55|DISEASE|
|    paracetamol|   69| 79|   DRUG|
|          fever|   89| 93|SYMPTOM|
|       headache|   99|106|SYMPTOM|
|     tonsilitis|  129|138|DISEASE|
|      ibuprofen|  141|149|   DRUG|
|    lansoprazol|  177|187|   DRUG|
|           GORD|  198|201|DISEASE|
+---------------+-----+---+-------+



## Regex Patterns

As shown in the example below we can define regex pattern to detect entities.

In [None]:
import json

data = [
    {
        "id": "date-regex",
        "label": "Date",
        "patterns": ["\\d{4}-\\d{2}-\\d{2}","\\d{4}"],
        "regex": True
    },
    {
        "id": "drug-words",
        "label": "Drug",
        "patterns": ["paracetamol", "aspirin", "ibuprofen", "lansoprazol"]
    },
    {
        "id": "disease-words",
        "label": "Disease",
        "patterns": ["heart condition","tonsilitis","GORD"]
    },
        {
        "id": "symptom-words",
        "label": "Symptom",
        "patterns": ["fever","headache"]
    },
]

with open("entities.json", "w") as f:
    json.dump(data, f)

In [None]:
entityRuler = nlp.EntityRulerApproach()\
    .setInputCols(["document", "token"])\
    .setOutputCol("entities")\
    .setPatternsResource("entities.json")\
    .setCaseSensitive(False)\

pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    tokenizer,
    entityRuler
])

data = spark.createDataFrame([['''John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.''']]).toDF("text")

model = pipeline.fit(data)
result = model.transform(data)

Checking the results:

In [None]:
result.select(F.explode(F.arrays_zip(result.entities.result,
                                      result.entities.begin,
                                      result.entities.end,
                                      result.entities.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=30)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|     2023-12-01|  206|215|   Date|
|        aspirin|   25| 31|   Drug|
|heart condition|   41| 55|Disease|
|    paracetamol|   69| 79|   Drug|
|          fever|   89| 93|Symptom|
|       headache|   99|106|Symptom|
|     tonsilitis|  129|138|Disease|
|      ibuprofen|  141|149|   Drug|
|    lansoprazol|  177|187|   Drug|
|           GORD|  198|201|Disease|
+---------------+-----+---+-------+



## `EntityRulerModel`

This annotator is an instantiated model of the `EntityRulerApproach`. Once you build an `EntityRulerApproach()`, you can save it and use it with `EntityRulerModel()` via `load()` function. <br/>

Let's re-build one of examples that we have done before and save it.

In [None]:
data = spark.createDataFrame([["John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01."]]).toDF("text")
data.show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                                                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.|
+-----------------------------------------------------------------------------------------------------------------------

Saving the approach to disk

In [None]:
empty_data = spark.createDataFrame([[""]]).toDF("text")

pipeline_model = pipeline.fit(empty_data)

pipeline_model.stages[-1].write().overwrite().save("ruler_approach_model")

Loading the saved model and using it with the `EntityRulerModel()` via `load`.

In [None]:
entity_ruler = nlp.EntityRulerModel.load('/content/ruler_approach_model') \
    .setInputCols(["document", "token"])\
    .setOutputCol("entities")

pipeline = nlp.Pipeline(stages=[documentAssembler,
                            tokenizer,
                            entity_ruler])

result = pipeline.fit(data).transform(data)

In [None]:
result.select(F.explode(F.arrays_zip(result.entities.result,
                                      result.entities.begin,
                                      result.entities.end,
                                      result.entities.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=30)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|     2023-12-01|  206|215|   Date|
|        aspirin|   25| 31|   Drug|
|heart condition|   41| 55|Disease|
|    paracetamol|   69| 79|   Drug|
|          fever|   89| 93|Symptom|
|       headache|   99|106|Symptom|
|     tonsilitis|  129|138|Disease|
|      ibuprofen|  141|149|   Drug|
|    lansoprazol|  177|187|   Drug|
|           GORD|  198|201|Disease|
+---------------+-----+---+-------+



## Using LightPipeline

The EntityRuler annotator can also be applied by using LightPipeline:

In [None]:
light_pipeline = nlp.LightPipeline(pipeline_model)

In [None]:
annotations = light_pipeline.fullAnnotate("John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.")[0]
annotations.keys()

dict_keys(['document', 'token', 'entities'])

In [None]:
annotations.get('entities')

[Annotation(chunk, 206, 215, 2023-12-01, {'entity': 'Date', 'id': 'date-regex', 'sentence': '0'}, []),
 Annotation(chunk, 25, 31, aspirin, {'entity': 'Drug', 'sentence': '0', 'id': 'drug-words'}, []),
 Annotation(chunk, 41, 55, heart condition, {'entity': 'Disease', 'sentence': '0', 'id': 'disease-words'}, []),
 Annotation(chunk, 69, 79, paracetamol, {'entity': 'Drug', 'sentence': '0', 'id': 'drug-words'}, []),
 Annotation(chunk, 89, 93, fever, {'entity': 'Symptom', 'sentence': '0', 'id': 'symptom-words'}, []),
 Annotation(chunk, 99, 106, headache, {'entity': 'Symptom', 'sentence': '0', 'id': 'symptom-words'}, []),
 Annotation(chunk, 129, 138, tonsilitis, {'entity': 'Disease', 'sentence': '0', 'id': 'disease-words'}, []),
 Annotation(chunk, 141, 149, ibuprofen, {'entity': 'Drug', 'sentence': '0', 'id': 'drug-words'}, []),
 Annotation(chunk, 177, 187, lansoprazol, {'entity': 'Drug', 'sentence': '0', 'id': 'drug-words'}, []),
 Annotation(chunk, 198, 201, GORD, {'entity': 'Disease', 'sent

Display the result with `spark-nlp-display`.

In [None]:
visualiser = nlp.viz.NerVisualizer()

visualiser.display(annotations, label_col='entities')