![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/21.01.EntityRuler.ipynb)

# **EntityRuler**

This notebook will cover the different parameter and usage of **EntityRuler**. There are 2 annotators to perform this task in Spark NLP; `EntityRulerApproach` and `EntityRulerModel`. <br/>

- `EntityRulerApproach` fits an Annotator to match exact strings or regex patterns provided in a file against a Document and assigns them an named entity. The definitions can contain any number of named entities. 
- `EntityRulerModel` is instantiated model of the `EntityRulerApproach`

**📖 Learning Objectives:**

1. Understand how to extract entities with predefined regex patterns or match predefined exact strings. 

2. Understand the difference between the `EntityRulerApproach` and `EntityRulerModel`.

3. Become comfortable using the different parameters of these annotators.

**🔗 Helpful Links:**

Documentation: [EntityRuler](https://nlp.johnsnowlabs.com/docs/en/annotators#entityruler)

Python Docs: [EntityRulerApproach](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/er/entity_ruler/index.html#sparknlp.annotator.er.entity_ruler.EntityRulerApproach), [EntityRulerModel](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/er/entity_ruler/index.html#sparknlp.annotator.er.entity_ruler.EntityRulerModel)

Scala Docs: [EntityRulerApproach](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/er/EntityRulerApproach.html), [EntityRulerModel](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/er/EntityRulerModel.html)

## **📜 Background**

Extracting entities is a significant task in NLP area. In Spark NLP, `EntityRulerApproach` and `EntityRulerModel` can be used to perform this task based on predefined custom file instead of building a high level machine learning/deep learning models. <br/>

There are multiple ways and formats to set the extraction resource. It is possible to set it either as a “JSON”, “JSONL” or “CSV” file. 

## **🎬 Colab Setup**

In [None]:
!pip install -q pyspark==3.1.2  spark-nlp==4.2.5

In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

spark = sparknlp.start()
spark

# `EntityRulerApproach`

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `TOKEN`
- Output: `CHUNK`

## **🔎 Parameters**


- `caseSensitive`: (Boolean) Whether to ignore case in index lookups (Default depends on model)
- `patternsResource` (String) Sets Resource in JSON or CSV format to map entities to patterns.
- `useStorage` (Boolean) Whether to use RocksDB storage to serialize patterns (Default: true).
- `SentenceMatch` (Boolean) Sets whether to find match at sentence level.
- `AlphabetResource` (String) Alphabet Resource (a simple plain text with all language characters)

### `setPatternsResource()`

There are multiple ways and formats to set the extraction resource. It is possible to set it either as a “JSON”, “JSONL” or “CSV” file. A path to the file needs to be provided to `setPatternsResource()` parameter. <br/>

The file format needs to be set as the `format` field in the `option` parameter map and depending on the file type, additional parameters might need to be set.

In our first example, we will define keywords and their labels to be matched by `EntityRulerApproach`. <br/>

If the file is in a JSON format, then the rule definitions need to be given in a list with the fields “label” and “patterns”:

In [3]:
#Defining the source JSON file and saving
import json

keywords = [
          {
            "label": "PERSON",
            "patterns": ["Jon", "John", "John Snow", "Jon Snow"]
          },
          {
            "label": "PERSON",
            "patterns": ["Eddard", "Eddard Stark"]
          },
          {
            "label": "LOCATION",
            "patterns": ["Winterfell"]
          },
         ]

with open('keywords.json', 'w') as jsonfile:
    json.dump(keywords, jsonfile)

Sample dataframe

In [4]:
data = spark.createDataFrame([["Lord Eddard Stark was the head of House Stark. John Snow lives in Winterfell."]]).toDF("text")
data.show(truncate=False)

+-----------------------------------------------------------------------------+
|text                                                                         |
+-----------------------------------------------------------------------------+
|Lord Eddard Stark was the head of House Stark. John Snow lives in Winterfell.|
+-----------------------------------------------------------------------------+



Building a pipeline with `EntityRulerApproach()`

In [5]:
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetector()\
    .setInputCols('document')\
    .setOutputCol('sentence')

entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") \
    .setPatternsResource("keywords.json") 

pipeline = Pipeline(stages=[documenter, 
                            sentenceDetector, 
                            entity_ruler])

pipeline_model = pipeline.fit(data)
result= pipeline_model.transform(data)

Checking matched entities

In [6]:
result.select("entity").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entity                                                                                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 5, 16, Eddard Stark, {entity -> PERSON, sentence -> 0}, []}, {chunk, 47, 55, John Snow, {entity -> PERSON, sentence -> 1}, []}, {chunk, 66, 75, Winterfell, {entity -> LOCATION, sentence -> 1}, []}]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [7]:
result.select(F.explode(F.arrays_zip('entity.result', 'entity.metadata')).alias('col'))\
      .select(F.expr("col['0']").alias("keyword"),
              F.expr("col['1']['entity']").alias("label")).show()

+------------+--------+
|     keyword|   label|
+------------+--------+
|Eddard Stark|  PERSON|
|   John Snow|  PERSON|
|  Winterfell|LOCATION|
+------------+--------+



As seen above, keywords that we defined in the JSON file were matched. 

We can define an "id" field to identify entities as the example below.

In [8]:
keywords = [
            {
              "id": "names-with-j",
              "label": "PERSON",
              "patterns": ["Jon", "John", "John Snow", "Jon Snow"]
            },
            {
              "id": "names-with-e",
              "label": "PERSON",
              "patterns": ["Eddard", "Eddard Stark"]
            },
            {
              "id": "locations",
              "label": "LOCATION",
              "patterns": ["Winterfell"]
            },
         ]

with open('keywords_with_id.json', 'w') as jsonfile:
    json.dump(keywords, jsonfile)

Defining the `EntityRulerApproach()` again with the new JSON.

In [9]:
entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") \
    .setPatternsResource("keywords_with_id.json") 

pipeline = Pipeline(stages=[documenter, 
                            sentenceDetector, 
                            entity_ruler])

pipeline_model = pipeline.fit(data)
result= pipeline_model.transform(data)

Checking the results

In [10]:
result.select("entity").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entity                                                                                                                                                                                                                                                                 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 5, 16, Eddard Stark, {entity -> PERSON, sentence -> 0, id -> names-with-e}, []}, {chunk, 47, 55, John Snow, {entity -> PERSON, sentence -> 1, id -> names-with-j}, []}, {chunk, 66, 75, Winterfe

In [11]:
result.select(F.explode(F.arrays_zip('entity.result', 'entity.metadata')).alias('col'))\
      .select(F.expr("col['1']['id']").alias("id"),
              F.expr("col['0']").alias("keyword"),
              F.expr("col['1']['entity']").alias("label")).show()

+------------+------------+--------+
|          id|     keyword|   label|
+------------+------------+--------+
|names-with-e|Eddard Stark|  PERSON|
|names-with-j|   John Snow|  PERSON|
|   locations|  Winterfell|LOCATION|
+------------+------------+--------+



As seen above, we succesfully defined the "id" section. 

Now, we will do an example with a source file in CSV format. For the CSV file we use the following configuration:

In [12]:
with open('keywords.csv', 'w') as csvfile:
    csvfile.write('PERSON|Jon\n')
    csvfile.write('PERSON|John\n')
    csvfile.write('PERSON|John Snow\n')
    csvfile.write('LOCATION|Winterfell')

In [13]:
! cat keywords.csv

PERSON|Jon
PERSON|John
PERSON|John Snow
LOCATION|Winterfell

Building `EntityRulerApproach()` with the CSV source file: <br/>
We will also set the **format** of the file as a CSV and will specify the **delimiter** for the CSV file.

In [14]:
entity_ruler_csv = EntityRulerApproach() \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") \
    .setPatternsResource("keywords.csv", options={"format": "csv", "delimiter": "\\|"}) \

pipeline = Pipeline(stages=[documenter, 
                            sentenceDetector, 
                            entity_ruler])

pipeline_model = pipeline.fit(data)
result= pipeline_model.transform(data)

Checking the results. 

In [15]:
result.select("entity").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entity                                                                                                                                                                                                                                                                 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 5, 16, Eddard Stark, {entity -> PERSON, sentence -> 0, id -> names-with-e}, []}, {chunk, 47, 55, John Snow, {entity -> PERSON, sentence -> 1, id -> names-with-j}, []}, {chunk, 66, 75, Winterfe

In [16]:
result.select(F.explode(F.arrays_zip('entity.result', 'entity.metadata')).alias('col'))\
      .select(F.expr("col['0']").alias("keyword"),
              F.expr("col['1']['entity']").alias("label")).show()

+------------+--------+
|     keyword|   label|
+------------+--------+
|Eddard Stark|  PERSON|
|   John Snow|  PERSON|
|  Winterfell|LOCATION|
+------------+--------+



As you see above, we successfully defined a CSV file as a source and used it with the `EntityRulerApproach()`

### Using Regular Expressions 



If you need to set a regex pattern for the matching, you can specify a key value pair as `"regex": true` in the source JSON file. 

In [17]:
#Sample data
data = spark.createDataFrame([["The address is 123456 in Winterfell"]]).toDF("text")
data.show(truncate=False)

+-----------------------------------+
|text                               |
+-----------------------------------+
|The address is 123456 in Winterfell|
+-----------------------------------+



We will define a JSON source file which has a regex rule for matching the 'id' and defined keywords for 'location'. <br/>

We will specify "id", "label", "pattern" and "regex" keys in the JSON. The "regex" should be set as 'True' for regex patterns. 

In [18]:
import json 

patterns_string = """
[
  {
    "id": "id-regex",
    "label": "ID",
    "patterns": ["[0-9]+"],
    "regex": true
      },
  {
    "id": "locations-words",
    "label": "LOCATION",
    "patterns": ["Winterfell"],
    "regex": false

  }
]
"""

patterns_obj = json.loads(patterns_string)
with open('regex_patterns.json', 'w') as jsonfile:
    json.dump(patterns_obj, jsonfile)

In [19]:
!cat regex_patterns.json

[{"id": "id-regex", "label": "ID", "patterns": ["[0-9]+"], "regex": true}, {"id": "locations-words", "label": "LOCATION", "patterns": ["Winterfell"], "regex": false}]

**Note:** When defining a regex pattern, we need to define Tokenizer annotator in the pipeline!

In [20]:
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetector()\
    .setInputCols('document')\
    .setOutputCol('sentence')

tokenizer = Tokenizer() \
    .setInputCols("sentence") \
    .setOutputCol("token")

regex_entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("entity") \
    .setPatternsResource("regex_patterns.json") 

pipeline = Pipeline(stages=[documenter, 
                            sentenceDetector, 
                            tokenizer,
                            regex_entity_ruler])

pipeline_model = pipeline.fit(data)
result= pipeline_model.transform(data)

Checking the results

In [21]:
result.select("entity").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entity                                                                                                                                                                   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 15, 20, 123456, {entity -> ID, id -> id-regex, sentence -> 0}, []}, {chunk, 25, 34, Winterfell, {entity -> LOCATION, sentence -> 0, id -> locations-words}, []}]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+



In [22]:
result.select(F.explode(F.arrays_zip('entity.result', 'entity.metadata')).alias('col'))\
      .select(F.expr("col['0']").alias("keyword"),
              F.expr("col['1']['entity']").alias("label")).show()

+----------+--------+
|   keyword|   label|
+----------+--------+
|    123456|      ID|
|Winterfell|LOCATION|
+----------+--------+



As seen above, we matched '123456' as an id as we defined with a regex in the source JSON file. 

### `setSentenceMatch()`
This parameter is used to set whether finding a match at sentence level. When we set `setSentenceMatch(False)`, we're looking for a match on each token of a sentence. You can change it to "True" to look for a match on each sentence of a document. The latter is particularly useful when working with multi-word matches <br/>

**Please note that this parameter only works for Regex Patterns!**

In [23]:
#sample data
data = spark.createDataFrame([["Patrick lives in New York City"]]).toDF("text")
data.show(truncate=False)

+------------------------------+
|text                          |
+------------------------------+
|Patrick lives in New York City|
+------------------------------+



In this example, we will extract "New York City" by defining a regex. 

In [24]:
import json

patterns_string = """
[
  {
    "id": "locations-words",
    "label": "LOCATION",
    "patterns": ["\\\\bNew York City\\\\b"],
    "regex": true
      },
  {
    "id": "name-words",
    "label": "NAME",
    "patterns": ["Patrick"],
    "regex": false

  }
]
"""

patterns_obj = json.loads(patterns_string)
with open('patterns.json', 'w') as jsonfile:
    json.dump(patterns_obj, jsonfile)

**`setSentenceMatch(False)`**: We do not expect any matching for the "New York City" since it is multi-token entity and we are not checking sentence level match. 

In [25]:
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetector()\
    .setInputCols('document')\
    .setOutputCol('sentence')

tokenizer = Tokenizer() \
    .setInputCols("sentence") \
    .setOutputCol("token")

regex_entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("entity") \
    .setPatternsResource("patterns.json") \
    .setSentenceMatch(False)

pipeline = Pipeline(stages=[documenter, 
                            sentenceDetector, 
                            tokenizer,
                            regex_entity_ruler])

pipeline_model = pipeline.fit(data)
result= pipeline_model.transform(data)

In [26]:
result.select("entity").show(truncate=False)

+-------------------------------------------------------------------------------+
|entity                                                                         |
+-------------------------------------------------------------------------------+
|[{chunk, 0, 6, Patrick, {entity -> NAME, sentence -> 0, id -> name-words}, []}]|
+-------------------------------------------------------------------------------+



In [27]:
result.select(F.explode(F.arrays_zip('entity.result', 'entity.metadata')).alias('col'))\
      .select(F.expr("col['0']").alias("keyword"),
              F.expr("col['1']['entity']").alias("label")).show()

+-------+-----+
|keyword|label|
+-------+-----+
|Patrick| NAME|
+-------+-----+



As you see, we could only matched "Patrick". 

**`setSentenceMatch(True)`**: Now, we expect a match for the "New York City" since we will check for the sentence level match. 

In [28]:
regex_entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("entity") \
    .setPatternsResource("patterns.json") \
    .setSentenceMatch(True)

pipeline = Pipeline(stages=[documenter, 
                            sentenceDetector, 
                            tokenizer,
                            regex_entity_ruler])

pipeline_model = pipeline.fit(data)
result= pipeline_model.transform(data)

In [29]:
result.select("entity").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entity                                                                                                                                                                         |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 17, 29, New York City, {entity -> LOCATION, id -> locations-words, sentence -> 0}, []}, {chunk, 0, 6, Patrick, {entity -> NAME, sentence -> 0, id -> name-words}, []}]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+



In [30]:
result.select(F.explode(F.arrays_zip('entity.result', 'entity.metadata')).alias('col'))\
      .select(F.expr("col['0']").alias("keyword"),
              F.expr("col['1']['entity']").alias("label")).show()

+-------------+--------+
|      keyword|   label|
+-------------+--------+
|New York City|LOCATION|
|      Patrick|    NAME|
+-------------+--------+



As seen above, "New York City" was matched.

### `setAlphabetResource`

Since Spark NLP version 4.2.0, we reduce significantly the latency of Entity Ruler by implementing Aho-Corasick algorithm. This requires defining an alphabet for some cases. For English documents, you won't need to define it because under the hood Entity Ruler annotator uses an English alphabet by default. However, for special use cases we will need to proceed like the example below:

In [31]:
data = spark.createDataFrame([["Elendil used to live in Númenor"]]).toDF("text")
data.show(truncate=False)

+-------------------------------+
|text                           |
+-------------------------------+
|Elendil used to live in Númenor|
+-------------------------------+



The text above has a special character, an accent in vowel u (ú)

In [32]:
import json

locations = [
              {
                "id": "locations",
                "label": "LOCATION",
                "patterns": ["Númenor", "Middle-earth"]
              }
            ]

with open('locations.json', 'w') as jsonlfile:
  json.dump(locations, jsonlfile)

In addition, a pattern in `locations.json` file has also hyphen punctuation mark (-). So, we need to define our custom alphabet to use Entity Ruler for Tolkien's books. Here, we will define just the 2 special characters for our text.

In [33]:
alphabet = "abcdefghijklmnopqrstuvwxyz"

with open('custom_alphabet.txt', 'w') as alphabet_file:
    alphabet_file.write(alphabet + "\n")
    alphabet_file.write(alphabet.upper() + "\n")
    alphabet_file.write("ú")
    alphabet_file.write("-")

In [34]:
!cat custom_alphabet.txt

abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
ú-

Now, we will build `EntityRulerApproach()` with that alphabet. 

In [35]:
entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") \
    .setPatternsResource("locations.json") \
    .setAlphabetResource("/content/custom_alphabet.txt")

pipeline = Pipeline(stages=[documenter, 
                            sentenceDetector, 
                            entity_ruler])

pipeline_model = pipeline.fit(data)
result= pipeline_model.transform(data)

Checking the results

In [36]:
result.select("entity").show(truncate=False)

+------------------------------------------------------------------------------------+
|entity                                                                              |
+------------------------------------------------------------------------------------+
|[{chunk, 24, 30, Númenor, {entity -> LOCATION, sentence -> 0, id -> locations}, []}]|
+------------------------------------------------------------------------------------+



In [37]:
result.select(F.explode(F.arrays_zip('entity.result', 'entity.metadata')).alias('col'))\
      .select(F.expr("col['0']").alias("keyword"),
              F.expr("col['1']['entity']").alias("label")).show()

+-------+--------+
|keyword|   label|
+-------+--------+
|Númenor|LOCATION|
+-------+--------+



As seen above, we successfully matched "Númenor" to LOCATION by defining an alphabet. <br/>

If you don't define the required alphabet, you will get this error:

```python
Py4JJavaError: An error occurred while calling o69.fit.
: java.lang.UnsupportedOperationException: Char ú not found on alphabet. Please check alphabet
```

So, the alphabet must have all the characters that can be found in your document.



#### Non-English Languages

`EntityRulerApproach` has some predefined alphabets for the most common languages: English, Spanish, French, and German. For example, if you have documents in Spanish, you just need to set an alphabet like the example below:

In [38]:
data = spark.createDataFrame([["Elendil solía vivir en Númenor"]]).toDF("text")
data.show(truncate=False)

+------------------------------+
|text                          |
+------------------------------+
|Elendil solía vivir en Númenor|
+------------------------------+



We will define the paramater as `setAlphabetResource("spanish")`. 

In [39]:
entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") \
    .setPatternsResource("locations.json") \
    .setAlphabetResource("spanish")

pipeline = Pipeline(stages=[documenter, 
                            sentenceDetector, 
                            entity_ruler])

pipeline_model = pipeline.fit(data)
result= pipeline_model.transform(data)

Checking the results

In [40]:
result.select("entity").show(truncate=False)

+------------------------------------------------------------------------------------+
|entity                                                                              |
+------------------------------------------------------------------------------------+
|[{chunk, 23, 29, Númenor, {entity -> LOCATION, sentence -> 0, id -> locations}, []}]|
+------------------------------------------------------------------------------------+



In [41]:
result.select(F.explode(F.arrays_zip('entity.result', 'entity.metadata')).alias('col'))\
      .select(F.expr("col['0']").alias("keyword"),
              F.expr("col['1']['entity']").alias("label")).show()

+-------+--------+
|keyword|   label|
+-------+--------+
|Númenor|LOCATION|
+-------+--------+



As seen above, we successfully matched a word from a Spanish text.  

If your language is not a predefined alphabet, you will need to define all the characters of your alphabet, as shown in the first example. Keep in mind that an alphabet may require not only letters but also numbers, punctuation marks, and symbol characters.

### `setCaseSensitive`

This parameter is used to set whether to ignore case in index lookups. 

Setting example keywords as lowercased:

In [42]:
data = spark.createDataFrame([["Lord eddard stark was the head of house stark. John snow lives in winterfell."]]).toDF("text")
data.show(truncate=False)

+-----------------------------------------------------------------------------+
|text                                                                         |
+-----------------------------------------------------------------------------+
|Lord eddard stark was the head of house stark. John snow lives in winterfell.|
+-----------------------------------------------------------------------------+



In [43]:
#Defining the source JSON file and saving
import json

keywords = [
          {
            "label": "PERSON",
            "patterns": ["Jon", "John", "John Snow", "Jon Snow"]
          },
          {
            "label": "PERSON",
            "patterns": ["Eddard", "Eddard Stark"]
          },
          {
            "label": "LOCATION",
            "patterns": ["Winterfell"]
          },
         ]

with open('keywords.json', 'w') as jsonfile:
    json.dump(keywords, jsonfile)

**`setCaseSensitive(True)`**: we expect no matching since there is no case match between the entities in the given sentence and the source JSON file. 

In [44]:
entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") \
    .setPatternsResource("keywords.json") \
    .setCaseSensitive(True)

pipeline = Pipeline(stages=[documenter, 
                            sentenceDetector, 
                            entity_ruler])

pipeline_model = pipeline.fit(data)
result= pipeline_model.transform(data)

In [45]:
result.select(F.explode(F.arrays_zip('entity.result', 'entity.metadata')).alias('col'))\
      .select(F.expr("col['0']").alias("keyword"),
              F.expr("col['1']['entity']").alias("label")).show()

+-------+-----+
|keyword|label|
+-------+-----+
+-------+-----+



**`setCaseSensitive(False)`**: we expect matching even though there is no case match between the entities in the given sentence and the source JSON file.

In [46]:
entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") \
    .setPatternsResource("keywords.json") \
    .setCaseSensitive(False)

pipeline = Pipeline(stages=[documenter, 
                            sentenceDetector, 
                            entity_ruler])

pipeline_model = pipeline.fit(data)
result= pipeline_model.transform(data)

In [47]:
result.select(F.explode(F.arrays_zip('entity.result', 'entity.metadata')).alias('col'))\
      .select(F.expr("col['0']").alias("keyword"),
              F.expr("col['1']['entity']").alias("label")).show()

+------------+--------+
|     keyword|   label|
+------------+--------+
|eddard stark|  PERSON|
|   John snow|  PERSON|
|  winterfell|LOCATION|
+------------+--------+



### `setUseStorage`

If this parameter is kept as False(default), the annotator will serialize patterns file data with SparkML parameters when saving the model. <br/>

We recommend using the default value `setUseStorage(False)` since the results of our benchmarks reflect that this configuration is faster than `setUseStorage(True)`

# `EntityRulerModel`

This annotator is instantiated model of the `EntityRulerApproach`. Once you build an `EntityRulerApproach()`, you can save it and use it with `EntityRulerModel()` via `load()` function. <br/>

Let's re-build one of examples that we have done before and save it. 

In [48]:
data = spark.createDataFrame([["Lord Eddard Stark was the head of House Stark. John Snow lives in Winterfell."]]).toDF("text")
data.show(truncate=False)

+-----------------------------------------------------------------------------+
|text                                                                         |
+-----------------------------------------------------------------------------+
|Lord Eddard Stark was the head of House Stark. John Snow lives in Winterfell.|
+-----------------------------------------------------------------------------+



In [49]:
#Defining the source JSON file and saving
import json

keywords = [
          {
            "label": "PERSON",
            "patterns": ["Jon", "John", "John Snow", "Jon Snow"]
          },
          {
            "label": "PERSON",
            "patterns": ["Eddard", "Eddard Stark"]
          },
          {
            "label": "LOCATION",
            "patterns": ["Winterfell"]
          },
         ]

with open('keywords.json', 'w') as jsonfile:
    json.dump(keywords, jsonfile)

In [50]:
entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") \
    .setPatternsResource("keywords.json") 
    
pipeline = Pipeline(stages=[documenter, 
                            sentenceDetector, 
                            entity_ruler])

pipeline_model = pipeline.fit(data)
result= pipeline_model.transform(data)

Saving the approach to disk

In [51]:
pipeline_model.stages[2].write().overwrite().save('models/ruler_approach_model')

Loading the saved model and using it with the `EntityRulerModel()` via `load`. 

In [52]:
entity_ruler = EntityRulerModel.load('/content/models/ruler_approach_model') \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") 

pipeline = Pipeline(stages=[documenter, 
                            sentenceDetector, 
                            entity_ruler])

pipeline_model = pipeline.fit(data)
result= pipeline_model.transform(data)

Checking the result

In [53]:
result.select(F.explode(F.arrays_zip('entity.result', 'entity.metadata')).alias('col'))\
      .select(F.expr("col['0']").alias("keyword"),
              F.expr("col['1']['entity']").alias("label")).show()

+------------+--------+
|     keyword|   label|
+------------+--------+
|Eddard Stark|  PERSON|
|   John Snow|  PERSON|
|  Winterfell|LOCATION|
+------------+--------+



As seen above, we built an `EntityRuler`, saved it and used the saved model with `EntityRulerModel`. 