![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/training/english/entity-ruler/EntityRuler_Alphabet.ipynb)

# Defining EntityRuler with an Alphabet

In [None]:
# Only run this Cell when you are using Spark NLP on Google Colab
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import SparkSession

In [None]:
spark = sparknlp.start()

Since Spark NLP version 4.3.1, we reduce significantly the latency of Entity Ruler by implementing Aho-Corasick algorithm. This requires defining an alphabet for some cases. For English documents, you won't need to define it because under the hood Entity Ruler annotator uses an English alphabet by default. However, for special use cases we will need to proceed like the example below:

In [None]:
data = spark.createDataFrame([["Elendil used to live in Númenor"]]).toDF("text")
data.show(truncate=False)

+-------------------------------+
|text                           |
+-------------------------------+
|Elendil used to live in Númenor|
+-------------------------------+



The text above has an special character, an accent in vowel u (ú)

In [None]:
import json

locations = [
              {
                "id": "locations",
                "label": "LOCATION",
                "patterns": ["Númenor", "Middle-earth"]
              }
            ]

with open('./locations.json', 'w') as jsonlfile:
  json.dump(locations, jsonlfile)

In addition, a pattern in `locations.json` file has also hyphen punctuation mark (-).
So, we need to define our custom alphabet to use Entity Ruler for Tolkien's books. Here, we will define just the 2 special characters for our text.

In [None]:
alphabet = "abcdefghijklmnopqrstuvwxyz"

with open('./custom_alphabet.txt', 'w') as alphabet_file:
    alphabet_file.write(alphabet + "\n")
    alphabet_file.write(alphabet.upper() + "\n")
    alphabet_file.write("ú")
    alphabet_file.write("-")

In [None]:
!cat custom_alphabet.txt

abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
ú-

In [None]:
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
sentence_detector = SentenceDetector().setInputCols("document").setOutputCol("sentence")

entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") \
    .setPatternsResource("./locations.json") \
    .setAlphabetResource("./custom_alphabet.txt")

In [None]:
pipeline = Pipeline(stages=[document_assembler, sentence_detector, entity_ruler])
model = pipeline.fit(data)

In [None]:
model.transform(data).select("entity").show(truncate=False)

+------------------------------------------------------------------------------------+
|entity                                                                              |
+------------------------------------------------------------------------------------+
|[{chunk, 24, 30, Númenor, {entity -> LOCATION, sentence -> 0, id -> locations}, []}]|
+------------------------------------------------------------------------------------+



If you don't define the required alphabet, you will get this error: 

```
Py4JJavaError: An error occurred while calling o69.fit.
: java.lang.UnsupportedOperationException: Char ú not found on alphabet. Please check alphabet
```
So, the alphabet must have **all the characters** that can be found in your document.

# Non-English Languages

EntityRuler has some predefined alphabets for the most common languages: English, Spanish, French, and German. So, if you have documents in Spanish, you just need to set an alphabet like the example below:

In [None]:
data = spark.createDataFrame([["Elendil solía vivir en Númenor"]]).toDF("text")
data.show(truncate=False)

+------------------------------+
|text                          |
+------------------------------+
|Elendil solía vivir en Númenor|
+------------------------------+



In [None]:
entity_ruler = EntityRulerApproach() \
    .setInputCols(["sentence"]) \
    .setOutputCol("entity") \
    .setPatternsResource("./locations.json") \
    .setAlphabetResource("spanish")

In [None]:
pipeline = Pipeline(stages=[document_assembler, sentence_detector, entity_ruler])
model = pipeline.fit(data)

In [None]:
model.transform(data).select("entity").show(truncate=False)

+------------------------------------------------------------------------------------+
|entity                                                                              |
+------------------------------------------------------------------------------------+
|[{chunk, 23, 29, Númenor, {entity -> LOCATION, sentence -> 0, id -> locations}, []}]|
+------------------------------------------------------------------------------------+



If your language is not a predefined alphabet, you will need to define all the characters of your alphabet, as shown in the first example. 
Keep in mind that an alphabet may require not only letters but also numbers, punctuation marks, and symbol characters.