![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# EntityRulerInternal

This notebook will cover the different parameter and usage of **EntityRulerInternal**. There are 2 annotators to perform this task in Spark NLP; `EntityRulerInternalApproach` and `EntityRulerInternalModel`. <br/>

This annotator matches exact strings or regex patterns provided in a file against a Document and assigns them a named entity. The definitions can contain any number of named entities.

**📖 Learning Objectives:**

1. Understand how to match exact strings or regex patterns by using pre-defined dictionary.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

For extended examples of usage, see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/42.TextMatcher.ipynb)

Reference Documentation: [EntityRulerInternal](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#entityrulerinternal)


## **📜 Background**


There are multiple ways and formats to set the extraction resource. It is
   possible to set it either as a "JSON", "JSONL" or "CSV" file. A path to the
   file needs to be provided to ``setPatternsResource``. The file format needs
   to be set as the "format" field in the ``option`` parameter map and
   depending on the file type, additional parameters might need to be set.

## **🎬 Colab Setup**

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.4.1

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [None]:
import json
import os

from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession

import sparknlp
import sparknlp_jsl

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
from sparknlp.util import *
from sparknlp.pretrained import ResourceDownloader
from pyspark.sql import functions as F

import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

import string
import numpy as np

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(secret = license_keys["SECRET"], params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.3.0
Spark NLP_JSL Version : 5.3.0


## **🖨️ Input/Output Annotation Types**
- Input: ``DOCUMENT`` , ``TOKEN``    
- Output: ``CHUNK``

## **🔎 Parameters**


- `setPatternsResource` *(str)*: Sets Resource in JSON or CSV format to map entities to patterns.
        path : str
            Path to the resource
        read_as : str, optional
            How to interpret the resource, by default ReadAs.TEXT
        options : dict, optional
            Options for parsing the resource, by default {"format": "JSON"}

- `setSentenceMatch` *(Boolean)*:Whether to find match at sentence level. True: sentence level. False: token level.

- `setAlphabetResource` *(str)*:  Alphabet Resource (a simple plain text with all language characters)

- `setUseStorage` *(Boolean)*:  Sets whether to use RocksDB storage to serialize patterns.





## `EntityRulerInternalApproach`

## Keywords Patterns

EntityRulerInternal will handle the chunks output based on the patterns defined, as shown in the example below. We can define an id field to identify entities.

In [None]:
import json

data = [

    {
        "id": "drug-words",
        "label": "Drug",
        "patterns": ["paracetamol", "aspirin", "ibuprofen", "lansoprazol"]
    },
    {
        "id": "disease-words",
        "label": "Disease",
        "patterns": ["heart condition","tonsilitis","GORD"]
    },
        {
        "id": "symptom-words",
        "label": "Symptom",
        "patterns": ["fever","headache"]
    },

]

with open("entities.json", "w") as f:
    json.dump(data, f)

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

entityRuler = EntityRulerInternalApproach()\
    .setInputCols(["document", "token"])\
    .setOutputCol("entities")\
    .setPatternsResource("entities.json")\
    .setCaseSensitive(False)\

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    entityRuler
])

data = spark.createDataFrame([['''John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.''']]).toDF("text")

model = pipeline.fit(data)
result = model.transform(data)

Checking the results:

In [None]:
result.select(F.explode(F.arrays_zip(
              result.entities.result,
              result.entities.begin,
              result.entities.end,
              result.entities.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=30)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|        aspirin|   25| 31|   Drug|
|heart condition|   41| 55|Disease|
|    paracetamol|   69| 79|   Drug|
|          fever|   89| 93|Symptom|
|       headache|   99|106|Symptom|
|     tonsilitis|  129|138|Disease|
|      ibuprofen|  141|149|   Drug|
|    lansoprazol|  177|187|   Drug|
|           GORD|  198|201|Disease|
+---------------+-----+---+-------+



For the CSV file we use the following configuration:


In [None]:
with open('./entities.csv', 'w') as csvfile:
    csvfile.write('SYMPTOM|fever\n')
    csvfile.write('SYMPTOM|headache\n')
    csvfile.write('DRUG|paracetamol\n')
    csvfile.write('DRUG|aspirin\n')
    csvfile.write('DRUG|lansoprazol\n')
    csvfile.write('DRUG|ibuprofen\n')
    csvfile.write('DISEASE|tonsilitis\n')
    csvfile.write('DISEASE|GORD\n')
    csvfile.write('DISEASE|heart condition')

In [None]:
! cat ./entities.csv

SYMPTOM|fever
SYMPTOM|headache
DRUG|paracetamol
DRUG|aspirin
DRUG|lansoprazol
DRUG|ibuprofen
DISEASE|tonsilitis
DISEASE|GORD
DISEASE|heart condition

In [None]:
entity_ruler_csv = EntityRulerInternalApproach() \
    .setInputCols(["document", "token"])\
    .setOutputCol("entities")\
    .setPatternsResource("./entities.csv", options={"format": "csv", "delimiter": "\\|"})

In [None]:
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    entity_ruler_csv
])

data = spark.createDataFrame([['''John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.''']]).toDF("text")

model = pipeline.fit(data)
result = model.transform(data)

Checking the results:

In [None]:
result.select(F.explode(F.arrays_zip(
              result.entities.result,
              result.entities.begin,
              result.entities.end,
              result.entities.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=30)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|        aspirin|   25| 31|   DRUG|
|heart condition|   41| 55|DISEASE|
|    paracetamol|   69| 79|   DRUG|
|          fever|   89| 93|SYMPTOM|
|       headache|   99|106|SYMPTOM|
|     tonsilitis|  129|138|DISEASE|
|      ibuprofen|  141|149|   DRUG|
|    lansoprazol|  177|187|   DRUG|
|           GORD|  198|201|DISEASE|
+---------------+-----+---+-------+



## Regex Patterns

As shown in the example below we can define regex pattern to detect entities.

In [None]:
import json

data = [
    {
        "id": "date-regex",
        "label": "Date",
        "patterns": ["\\d{4}-\\d{2}-\\d{2}","\\d{4}"],
        "regex": True
    },
    {
        "id": "drug-words",
        "label": "Drug",
        "patterns": ["paracetamol", "aspirin", "ibuprofen", "lansoprazol"]
    },
    {
        "id": "disease-words",
        "label": "Disease",
        "patterns": ["heart condition","tonsilitis","GORD"]
    },
        {
        "id": "symptom-words",
        "label": "Symptom",
        "patterns": ["fever","headache"]
    },

]

with open("entities.json", "w") as f:
    json.dump(data, f)

In [None]:
entityRuler = EntityRulerInternalApproach()\
    .setInputCols(["document", "token"])\
    .setOutputCol("entities")\
    .setPatternsResource("entities.json")\
    .setCaseSensitive(False)\

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    entityRuler
])

data = spark.createDataFrame([['''John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.''']]).toDF("text")

model = pipeline.fit(data)
result = model.transform(data)

Checking the results:

In [None]:
result.select(F.explode(F.arrays_zip(
              result.entities.result,
              result.entities.begin,
              result.entities.end,
              result.entities.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=30)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|     2023-12-01|  206|215|   Date|
|        aspirin|   25| 31|   Drug|
|heart condition|   41| 55|Disease|
|    paracetamol|   69| 79|   Drug|
|          fever|   89| 93|Symptom|
|       headache|   99|106|Symptom|
|     tonsilitis|  129|138|Disease|
|      ibuprofen|  141|149|   Drug|
|    lansoprazol|  177|187|   Drug|
|           GORD|  198|201|Disease|
+---------------+-----+---+-------+



## `EntityRulerInternalModel`

This annotator is an instantiated model of the `EntityRulerInternalApproach`. Once you build an `EntityRulerInternalApproach()`, you can save it and use it with `EntityRulerInternalModel()` via `load()` function. <br/>

Let's re-build one of examples that we have done before and save it.

In [None]:
data = spark.createDataFrame([["John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01."]]).toDF("text")
data.show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                                                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.|
+-----------------------------------------------------------------------------------------------------------------------

Saving the approach to disk

In [None]:
model.stages[-1].write().overwrite().save("ruler_approach_model")

Loading the saved model and using it with the `EntityRulerInternalModel()` via `load`.

In [None]:
entity_ruler = EntityRulerInternalModel.load('/content/ruler_approach_model') \
    .setInputCols(["document", "token"])\
    .setOutputCol("entities")

pipeline = Pipeline(stages=[documentAssembler,
                            tokenizer,
                            entity_ruler])

pipeline_model = pipeline.fit(data)
result = pipeline_model.transform(data)

In [None]:
result.select(F.explode(F.arrays_zip(
              result.entities.result,
              result.entities.begin,
              result.entities.end,
              result.entities.metadata,)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(truncate=30)

+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|     2023-12-01|  206|215|   Date|
|        aspirin|   25| 31|   Drug|
|heart condition|   41| 55|Disease|
|    paracetamol|   69| 79|   Drug|
|          fever|   89| 93|Symptom|
|       headache|   99|106|Symptom|
|     tonsilitis|  129|138|Disease|
|      ibuprofen|  141|149|   Drug|
|    lansoprazol|  177|187|   Drug|
|           GORD|  198|201|Disease|
+---------------+-----+---+-------+



## Using LightPipeline

The EntityRulerInternal annotator can also be applied by using LightPipeline:

In [None]:
light_pipeline = LightPipeline(pipeline_model)

In [None]:
annotations = light_pipeline.fullAnnotate("John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.")[0]
annotations.keys()

dict_keys(['document', 'token', 'entities'])

In [None]:
annotations.get('entities')

[Annotation(chunk, 206, 215, 2023-12-01, {'entity': 'Date', 'id': 'date-regex', 'sentence': '0'}, []),
 Annotation(chunk, 25, 31, aspirin, {'entity': 'Drug', 'sentence': '0', 'id': 'drug-words'}, []),
 Annotation(chunk, 41, 55, heart condition, {'entity': 'Disease', 'sentence': '0', 'id': 'disease-words'}, []),
 Annotation(chunk, 69, 79, paracetamol, {'entity': 'Drug', 'sentence': '0', 'id': 'drug-words'}, []),
 Annotation(chunk, 89, 93, fever, {'entity': 'Symptom', 'sentence': '0', 'id': 'symptom-words'}, []),
 Annotation(chunk, 99, 106, headache, {'entity': 'Symptom', 'sentence': '0', 'id': 'symptom-words'}, []),
 Annotation(chunk, 129, 138, tonsilitis, {'entity': 'Disease', 'sentence': '0', 'id': 'disease-words'}, []),
 Annotation(chunk, 141, 149, ibuprofen, {'entity': 'Drug', 'sentence': '0', 'id': 'drug-words'}, []),
 Annotation(chunk, 177, 187, lansoprazol, {'entity': 'Drug', 'sentence': '0', 'id': 'drug-words'}, []),
 Annotation(chunk, 198, 201, GORD, {'entity': 'Disease', 'sent

Display the result with `spark-nlp-display`.

In [None]:
from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

visualiser.display(annotations, label_col='entities')