![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)



[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/open-source-nlp/04.5.EntityRuler.ipynb)

# Rule-based Entity Recognition with EntityRuler

This notebook will cover the different parameter and usage of **EntityRuler**. There are 2 annotators to perform this task in Spark NLP; `EntityRulerApproach` and `EntityRulerModel`. <br/>

- `EntityRulerApproach` fits an Annotator to match exact strings or regex patterns provided in a file against a Document and assigns them an named entity. The definitions can contain any number of named entities.
- `EntityRulerModel` is instantiated model of the `EntityRulerApproach`

## Install Spark NLP

In [1]:
!pip install -q pyspark==3.4.1 spark-nlp==5.3.2

In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import functions as F
from pyspark.sql import SparkSession

spark = sparknlp.start()
spark

In [4]:
data = spark.createDataFrame([["Lord Eddard Stark was the head of House Stark. John Snow lives in Winterfell."]]).toDF("text")

data.show(truncate=False)

+-----------------------------------------------------------------------------+
|text                                                                         |
+-----------------------------------------------------------------------------+
|Lord Eddard Stark was the head of House Stark. John Snow lives in Winterfell.|
+-----------------------------------------------------------------------------+



# EntityRulerApproach

## Keywords Patterns

EntityRuler will handle the chunks output based on the patterns defined, as shown in the example below.

In [5]:
import json

keywords = [
  {
    "label": "PERSON",
    "patterns": ["Jon", "John", "John Snow", "Jon Snow"]
  },
  {
    "label": "PERSON",
    "patterns": ["Eddard", "Eddard Stark"]
  },
  {
    "label": "LOCATION",
    "patterns": ["Winterfell"]
  },
]

with open('./keywords.json', 'w') as jsonfile:
    json.dump(keywords, jsonfile)

We are going to use a JSON file with the following format:

In [31]:
! cat ./keywords.json

[{"label": "PERSON", "patterns": ["Jon", "John", "John Snow", "Jon Snow"]}, {"label": "PERSON", "patterns": ["Eddard", "Eddard Stark"]}, {"label": "LOCATION", "patterns": ["Winterfell"]}]

When working with keywords, we DON'T need a pipeline with Tokenizer

In [7]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetector()\
    .setInputCols("document")\
    .setOutputCol("sentence")

entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") \
    .setPatternsResource("./keywords.json") \
    .setUseStorage(True)

pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        entity_ruler
])

result = pipeline.fit(data).transform(data)

In [8]:
result.select("entity").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entity                                                                                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 5, 16, Eddard Stark, {entity -> PERSON, sentence -> 0}, []}, {chunk, 47, 55, John Snow, {entity -> PERSON, sentence -> 1}, []}, {chunk, 66, 75, Winterfell, {entity -> LOCATION, sentence -> 1}, []}]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------

We can define an id field to identify entities and it supports JSON Lines format as the example below.

In [9]:
keywords = [
    {
        "id": "names-with-j",
        "label": "PERSON",
        "patterns": ["Jon", "John", "John Snow", "Jon Snow"]
    },
    {
        "id": "names-with-e",
        "label": "PERSON",
        "patterns": ["Eddard", "Eddard Stark"]
    },
    {
        "id": "locations",
        "label": "LOCATION",
        "patterns": ["Winterfell"]
    },
]

with open('./keywords.jsonl', 'w') as jsonlfile:
    for keyword in keywords:
        json.dump(keyword, jsonlfile)
        jsonlfile.write('\n')

In [10]:
! cat ./keywords.jsonl

{"id": "names-with-j", "label": "PERSON", "patterns": ["Jon", "John", "John Snow", "Jon Snow"]}
{"id": "names-with-e", "label": "PERSON", "patterns": ["Eddard", "Eddard Stark"]}
{"id": "locations", "label": "LOCATION", "patterns": ["Winterfell"]}


In [11]:
entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") \
    .setPatternsResource("./keywords.jsonl", ReadAs.TEXT, options={"format": "JSONL"}) \
    .setUseStorage(True)

pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        entity_ruler
])

In [12]:
result = pipeline.fit(data).transform(data)

result.select("entity").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entity                                                                                                                                                                                                                                                                 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 5, 16, Eddard Stark, {entity -> PERSON, sentence -> 0, id -> names-with-e}, []}, {chunk, 47, 55, John Snow, {entity -> PERSON, sentence -> 1, id -> names-with-j}, []}, {chunk, 66, 75, Winterfe

For the CSV file we use the following configuration:


In [13]:
with open('./keywords.csv', 'w') as csvfile:
    csvfile.write('PERSON|Jon\n')
    csvfile.write('PERSON|John\n')
    csvfile.write('PERSON|John Snow\n')
    csvfile.write('LOCATION|Winterfell')

In [14]:
! cat ./keywords.csv

PERSON|Jon
PERSON|John
PERSON|John Snow
LOCATION|Winterfell

In [15]:
entity_ruler_csv = EntityRulerApproach() \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") \
    .setPatternsResource("./keywords.csv", options={"format": "csv", "delimiter": "\\|"}) \
    .setUseStorage(True)

pipeline_csv = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        entity_ruler_csv
])

In [16]:
result_csv = pipeline_csv.fit(data).transform(data)

result_csv.select("entity").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------+
|entity                                                                                                                                   |
+-----------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 47, 55, John Snow, {entity -> PERSON, sentence -> 1}, []}, {chunk, 66, 75, Winterfell, {entity -> LOCATION, sentence -> 1}, []}]|
+-----------------------------------------------------------------------------------------------------------------------------------------+



## Regex Patterns

Starting with Spark NLP 4.2.0 regex patterns must be defined at a more granular level, with each label. For example we can have the JSON file below

In [17]:
data = spark.createDataFrame([["The address is 123456 in Winterfell"]]).toDF("text")

In [18]:
patterns_string = """
[
  {
    "id": "id-regex",
    "label": "ID",
    "patterns": ["[0-9]+"],
    "regex": true
  },
  {
    "id": "locations-words",
    "label": "LOCATION",
    "patterns": ["Winterfell"],
    "regex": false
  }
]
"""
patterns_obj = json.loads(patterns_string)
with open('./patterns.json', 'w') as jsonfile:
    json.dump(patterns_obj, jsonfile)

In [19]:
!cat ./patterns.json

[{"id": "id-regex", "label": "ID", "patterns": ["[0-9]+"], "regex": true}, {"id": "locations-words", "label": "LOCATION", "patterns": ["Winterfell"], "regex": false}]

When defining a regex pattern, we need to define Tokenizer annotator in the pipeline

In [20]:
tokenizer = Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

regex_entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("entity") \
    .setPatternsResource("./patterns.json") \
    .setUseStorage(True)

regex_pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        regex_entity_ruler
])

In [21]:
regex_result = regex_pipeline.fit(data).transform(data)

regex_result.select("entity").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entity                                                                                                                                                                   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 15, 20, 123456, {entity -> ID, id -> id-regex, sentence -> 0}, []}, {chunk, 25, 34, Winterfell, {entity -> LOCATION, sentence -> 0, id -> locations-words}, []}]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+



# `EntityRulerModel`

This annotator is an instantiated model of the `EntityRulerApproach`. Once you build an `EntityRulerApproach()`, you can save it and use it with `EntityRulerModel()` via `load()` function. <br/>

Let's re-build one of examples that we have done before and save it.

In [22]:
data = spark.createDataFrame([["Lord Eddard Stark was the head of House Stark. John Snow lives in Winterfell."]]).toDF("text")
data.show(truncate=False)

+-----------------------------------------------------------------------------+
|text                                                                         |
+-----------------------------------------------------------------------------+
|Lord Eddard Stark was the head of House Stark. John Snow lives in Winterfell.|
+-----------------------------------------------------------------------------+



In [23]:
#Defining the source JSON file and saving
import json

keywords = [
    {
        "label": "PERSON",
        "patterns": ["Jon", "John", "John Snow", "Jon Snow"]
    },
    {
        "label": "PERSON",
        "patterns": ["Eddard", "Eddard Stark"]
    },
    {
        "label": "LOCATION",
        "patterns": ["Winterfell"]
    },
]

with open('keywords.json', 'w') as jsonfile:
    json.dump(keywords, jsonfile)

In [24]:
entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") \
    .setPatternsResource("keywords.json")

pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        entity_ruler
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
pipeline_model = pipeline.fit(empty_data)

Saving the approach to disk

In [25]:
pipeline_model.stages[-1].write().overwrite().save('models/ruler_approach_model')

Loading the saved model and using it with the `EntityRulerModel()` via `load`.

In [26]:
entity_ruler = EntityRulerModel.load('/content/models/ruler_approach_model') \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity")

pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        entity_ruler
])

result = pipeline.fit(data).transform(data)

Checking the result

In [27]:
result.select(F.explode(F.arrays_zip(result.entity.result,
                                     result.entity.metadata)).alias('col'))\
      .select(F.expr("col['0']").alias("keyword"),
              F.expr("col['1']['entity']").alias("label")).show()

+------------+--------+
|     keyword|   label|
+------------+--------+
|Eddard Stark|  PERSON|
|   John Snow|  PERSON|
|  Winterfell|LOCATION|
+------------+--------+



As seen above, we built an `EntityRuler`, saved it and used the saved model with `EntityRulerModel`.

### Using LightPipeline

The EntityRuler annotator can also be applied by using LightPipeline:

In [28]:
light_pipeline = LightPipeline(pipeline_model)

In [29]:
annotations = light_pipeline.fullAnnotate("Doctor John Snow lives in London, whereas Lord Commander Jon Snow lives in Castle Black")[0]
annotations.keys()

dict_keys(['document', 'sentence', 'entity'])

In [30]:
annotations.get('entity')

[Annotation(chunk, 7, 15, John Snow, {'entity': 'PERSON', 'sentence': '0'}, []),
 Annotation(chunk, 57, 64, Jon Snow, {'entity': 'PERSON', 'sentence': '0'}, [])]