![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/entity-ruler/EntityRuler.ipynb)

In [None]:
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash

In [5]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import SparkSession

In [None]:
spark = sparknlp.start()

This notebook uses the default configuration (useStorage=true). This parameter tells the annotator to serialize patterns file data with RocksDB storage when saving the model.

In [7]:
data = spark.createDataFrame([["Lord Eddard Stark was the head of House Stark. John Snow lives in Winterfell."]]).toDF("text")

In [8]:
data.show(truncate=False)

+-----------------------------------------------------------------------------+
|text                                                                         |
+-----------------------------------------------------------------------------+
|Lord Eddard Stark was the head of House Stark. John Snow lives in Winterfell.|
+-----------------------------------------------------------------------------+



# Keywords Patterns

EntityRuler no longer needs `Tokenizer` or `RegexTokenizer` annotatos when using keywords patterns(non-regex patterns). It will handle the chunks output based on the patterns defined, as shown in the example below.

In [9]:
import json

keywords = [
          {
            "label": "PERSON",
            "patterns": ["Jon", "John", "John Snow", "Jon Snow"]
          },
          {
            "label": "PERSON",
            "patterns": ["Eddard", "Eddard Stark"]
          },
          {
            "label": "LOCATION",
            "patterns": ["Winterfell"]
          },
         ]

with open('./keywords.json', 'w') as jsonfile:
    json.dump(keywords, jsonfile)

We are going to use a JSON file with the following format:

In [10]:
! cat ./person.json

cat: ./person.json: No such file or directory


When working with keywords, we DON'T need a pipeline with Tokenizer

In [11]:
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
sentence_detector = SentenceDetector().setInputCols("document").setOutputCol("sentence")

entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") \
    .setPatternsResource("./keywords.json") \
    .setUseStorage(True)

In [12]:
pipeline = Pipeline(stages=[document_assembler, sentence_detector, entity_ruler])
pipeline_model = pipeline.fit(data)

In [13]:
pipeline_model.transform(data).select("entity").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entity                                                                                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 5, 16, Eddard Stark, {entity -> PERSON, sentence -> 0}, []}, {chunk, 47, 55, John Snow, {entity -> PERSON, sentence -> 1}, []}, {chunk, 66, 75, Winterfell, {entity -> LOCATION, sentence -> 1}, []}]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [14]:
light_pipeline = LightPipeline(pipeline_model)

In [15]:
annotations = light_pipeline.fullAnnotate("Doctor John Snow lives in London, whereas Lord Commander Jon Snow lives in Castle Black")[0]
annotations.keys()

dict_keys(['document', 'sentence', 'entity'])

In [16]:
annotations.get('entity')

[Annotation(chunk, 7, 15, John Snow, {'entity': 'PERSON', 'sentence': '0'}),
 Annotation(chunk, 57, 64, Jon Snow, {'entity': 'PERSON', 'sentence': '0'})]

We can define an id field to identify entities and it supports JSON Lines format as the example below.

In [17]:
keywords = [
            {
              "id": "names-with-j",
              "label": "PERSON",
              "patterns": ["Jon", "John", "John Snow", "Jon Snow"]
            },
            {
              "id": "names-with-e",
              "label": "PERSON",
              "patterns": ["Eddard", "Eddard Stark"]
            },
            {
              "id": "locations",
              "label": "LOCATION",
              "patterns": ["Winterfell"]
            },
         ]

with open('./keywords.jsonl', 'w') as jsonlfile:
    for keyword in keywords:
      json.dump(keyword, jsonlfile)
      jsonlfile.write('\n')

In [18]:
! cat ./keywords.jsonl

{"id": "names-with-j", "label": "PERSON", "patterns": ["Jon", "John", "John Snow", "Jon Snow"]}
{"id": "names-with-e", "label": "PERSON", "patterns": ["Eddard", "Eddard Stark"]}
{"id": "locations", "label": "LOCATION", "patterns": ["Winterfell"]}


In [19]:
entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") \
    .setPatternsResource("./keywords.jsonl", ReadAs.TEXT, options={"format": "JSONL"}) \
    .setUseStorage(True)

In [20]:
pipeline = Pipeline(stages=[document_assembler, sentence_detector, entity_ruler])
model = pipeline.fit(data)
model.transform(data).select("entity").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entity                                                                                                                                                                                                                                                                 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 5, 16, Eddard Stark, {entity -> PERSON, sentence -> 0, id -> names-with-e}, []}, {chunk, 47, 55, John Snow, {entity -> PERSON, sentence -> 1, id -> names-with-j}, []}, {chunk, 66, 75, Winterfe

For the CSV file we use the following configuration:


In [21]:
with open('./keywords.csv', 'w') as csvfile:
    csvfile.write('PERSON|Jon\n')
    csvfile.write('PERSON|John\n')
    csvfile.write('PERSON|John Snow\n')
    csvfile.write('LOCATION|Winterfell')

In [22]:
! cat ./keywords.csv

PERSON|Jon
PERSON|John
PERSON|John Snow
LOCATION|Winterfell

In [23]:
entity_ruler_csv = EntityRulerApproach() \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") \
    .setPatternsResource("./keywords.csv", options={"format": "csv", "delimiter": "\\|"}) \
    .setUseStorage(True)

In [24]:
pipeline_csv = Pipeline(stages=[document_assembler, sentence_detector, entity_ruler_csv])
model_csv = pipeline_csv.fit(data)

In [25]:
model_csv.transform(data).select("entity").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------+
|entity                                                                                                                                   |
+-----------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 47, 55, John Snow, {entity -> PERSON, sentence -> 1}, []}, {chunk, 66, 75, Winterfell, {entity -> LOCATION, sentence -> 1}, []}]|
+-----------------------------------------------------------------------------------------------------------------------------------------+



# Regex Patterns

Starting with Spark NLP 4.2.0 regex patterns must be defined at a more granular level, with each label. For example we can have the JSON file below

In [26]:
data = spark.createDataFrame([["The address is 123456 in Winterfell"]]).toDF("text")

In [27]:
patterns_string = """
[
  {
    "id": "id-regex",
    "label": "ID",
    "patterns": ["[0-9]+"],
    "regex": true
  },
  {
    "id": "locations-words",
    "label": "LOCATION",
    "patterns": ["Winterfell"],
    "regex": false
  }
]
"""
patterns_obj = json.loads(patterns_string)
with open('./patterns.json', 'w') as jsonfile:
    json.dump(patterns_obj, jsonfile)

In [28]:
!cat ./patterns.json

[{"id": "id-regex", "label": "ID", "patterns": ["[0-9]+"], "regex": true}, {"id": "locations-words", "label": "LOCATION", "patterns": ["Winterfell"], "regex": false}]

When defining a regex pattern, we need to define Tokenizer annotator in the pipeline

In [29]:
tokenizer = Tokenizer().setInputCols("sentence").setOutputCol("token")

In [30]:
regex_entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("entity") \
    .setPatternsResource("./patterns.json") \
    .setUseStorage(True)

In [31]:
regex_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, regex_entity_ruler])
regex_model = regex_pipeline.fit(data)

In [32]:
regex_model.transform(data).select("entity").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entity                                                                                                                                                                   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 15, 20, 123456, {entity -> ID, id -> id-regex, sentence -> 0}, []}, {chunk, 25, 34, Winterfell, {entity -> LOCATION, sentence -> 0, id -> locations-words}, []}]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

