![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/15.0.EntityRuler_with_Clinical_NER_Models.ipynb)

# EntityRuler

`EntityRuler` fits to match exact strings or regex patterns provided in a file against a document and assigns them a named entity. The definitions can contain any number of named entities.

There are multiple ways and formats to set the extraction resource. It is possible to set it either as a “JSON”, “JSONL” or “CSV” file.

This notebook showcases the `EntityRuler` annotator with the Healthcare library. For detailed usage of `EntityRuler` itself please check [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/e3d3d942a75752d8040f73538c7f8ce5430e80d9/jupyter/training/english/entity-ruler).

**For the licensed users, `ContextualParser` is a more capable annotator. You can check [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Healthcare/1.2.Contextual_Parser_Rule_Based_NER.ipynb) for more info on `ContextualParser`.**





## Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs==5.1.0

In [None]:
from google.colab import files
print('Please upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8283.json
🚨 Outdated OCR Secrets in license file. Version=5.0.0 but should be Version=5.0.1
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8283.json
👌 JSL-Home is up to date! 
👌 Everything is already installed, no changes made


In [None]:
from johnsnowlabs import nlp, medical

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8283.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.1.0, 💊Spark-Healthcare==5.1.0, running on ⚡ PySpark==3.1.2


In [None]:
spark

## Define EntityRuler

Now let's define keyword patterns for entities to use in `EntityRuler`.

In [None]:
import json

person = [
          {
            "label": "Person",
            "patterns": ["Jon", "John", "John Snow", "Jon Snow"]
          },
          {
            "label": "Person",
            "patterns": ["Eddard", "Eddard Stark"]
          },
          {
            "label": "Clinical_Department",
            "patterns": ["St. John Hospital", "St. Jon Hospital" ]
          },
         ]

with open('./keywords.json', 'w') as jsonfile:
    json.dump(person, jsonfile)

In [None]:
import pyspark.sql.functions as F
def get_ner_table(result, column):
    """
    Helper function to get a ner table in Pandas dataframe from result
    """
    out = result.select(F.explode(F.arrays_zip(eval(f"result.{column}.result"),
                                               eval(f"result.{column}.begin"),
                                               eval(f"result.{column}.end"),
                                               eval(f"result.{column}.metadata"))).alias("cols")) \
                .select(F.expr("cols['0']").alias("chunk"),
                        F.expr("cols['3']['entity']").alias("ner_label"),
                        F.expr("cols['3']['sentence']").alias("sentence"),
                        F.expr("cols['1']").alias("begin"),
                        F.expr("cols['2']").alias("end"))

    return out.toPandas()


In [None]:
# Define an EntityRule with EntityRulerApproach
entity_ruler = nlp.EntityRulerApproach() \
    .setInputCols(["document"]) \
    .setOutputCol("entity") \
    .setPatternsResource("./keywords.json")\
    .setCaseSensitive(False)

data = spark.createDataFrame([[""]]).toDF("text")

entity_ruler_model = entity_ruler.fit(data)

# Save EntityRule model
entity_ruler_model.write().overwrite().save("tmp_entity_ruler_model")

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
    .setInputCols("document")\
    .setOutputCol("sentence")

# Extracting entities using the saved model with EntityRulerModel
entity_ruler_loaded = nlp.EntityRulerModel().load("tmp_entity_ruler_model")\
    .setInputCols(["sentence"]) \
    .setOutputCol("entity_ruler") \


# Build Pipeline
pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        entity_ruler_loaded
])

pipeline_model = pipeline.fit(data)


In [None]:
text="Lord Eddard Stark was the head of St. John Hospital. John Snow lives in Winterfell and is a doctor at St. john Hospital."

data = spark.createDataFrame([[text]]).toDF("text")

In [None]:
result = pipeline_model.transform(data).cache()

result.select("entity_ruler").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entity_ruler                                                                                                                                                                                                                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 5, 16, Eddard Stark, {entity -> Person,

In [None]:
get_ner_table(result, "entity_ruler")

Unnamed: 0,chunk,ner_label,sentence,begin,end
0,Eddard Stark,Person,0,5,16
1,St. John Hospital,Clinical_Department,0,34,50
2,John Snow,Person,1,53,61
3,St. john Hospital,Clinical_Department,1,102,118


## Combining EntityRuler with Pretrained NER Models


Now we will use pretrained NER models with the `EntityRuler` annotator. Sometimes NER models fail to extract some chunks or some entity labels may be missing in that model. In that case, `EntityRuler` can be used to enhance or improve the NER coverage like `ContextualParser`. In the example below we will add `ID`, `Female`, and `Male` entities that are not a part of a ner_jsl NER model.  

In [None]:
entities ="""
[
    {
        "id": "person",
        "label": "Female",
        "patterns": ["she", "her", "girl", "woman", "women", "womanish", "womanlike", "womanly", "madam", "madame", "senora", "lady", "miss", "girlfriend", "wife", "bride", "misses", "mrs.", "female"],
        "regex": false
    },
    {
        "id": "person",
        "label": "Male",
        "patterns": ["he", "him", "masculine", "boy", "father", "guy", "macho", "brother", "fellow", "gent", "gentleman", "grandfather", "husband", "sir", "son", "manful", "manlike", "manly"],
        "regex": false
    },
    {
        "id": "id-regex",
        "label": "ID",
        "patterns": ["[0-9]{7}"],
        "regex": true
    }
]"""


patterns_obj = json.loads(entities)

with open('./entities.json', 'w') as jsonfile:
    json.dump(patterns_obj, jsonfile)

When defining a regex pattern in `EntityRuler`, we need to define `Tokenizer` annotator in the pipeline.

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained()\
    .setInputCols("document")\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

# Extracting entities by EntityRuler
entity_ruler = nlp.EntityRulerApproach() \
    .setInputCols(["document"]) \
    .setOutputCol("ner_entity_ruler") \
    .setPatternsResource("./entities.json")\
    .setCaseSensitive(False)\

# Clinical word embeddings
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# Extracting entities by ner_jsl
ner_model = medical.NerModel.pretrained("ner_jsl","en","clinical/models") \
    .setInputCols("sentence","token","embeddings") \
    .setOutputCol("ner_jsl")

ner_converter= medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_jsl"])\
    .setOutputCol("ner_jsl_chunk")\

# Chunkmerger; prioritize EntityRuler entities
merger= medical.ChunkMergeApproach()\
    .setInputCols(["ner_entity_ruler", "ner_jsl_chunk"])\
    .setOutputCol("ner_chunk")

# Build Pipeline
pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        entity_ruler,
        word_embeddings,
        ner_model,
        ner_converter,
        merger
])

data = spark.createDataFrame([[""]]).toDF("text")

pipeline_model = pipeline.fit(data)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]


In [None]:
sample_text = """Patient # 5874651 is a 28 year old female with a history of gestational diabetes mellitus diagnosed eight years prior to
presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis
three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index
( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.
Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG .
She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was
significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding ,
or rigidity . findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l ,
anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin
( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed
as blood samples kept hemolyzing due to significant lipemia .
The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior
to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL ,
the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL ,
and lipase was 52 U/L .
β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged
and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again .
This madame was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides
to 1400 mg/dL , within 24 hours .
Twenty days ago.
Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use .
At birth the typical boy is growing slightly faster than the typical girl, but the velocities become equal at about
seven months, and then the girl grows faster until four years.
From then until adolescence no differences in velocity
can be detected. 21-02-2020
21/04/2020
"""

data = spark.createDataFrame([[sample_text]]).toDF("text")

In [None]:
result = pipeline_model.transform(data)

### Error Handling Caused by Missing Alphabet

**❗Attention** Below code will fail, please read following explanations.

In [None]:
get_ner_table(result, "ner_chunk")

The above code will fail. Since Spark NLP version 4.2.0, `EntityRuler` requires defining an alphabet for some cases. The above sample text includes a non-standard character `β`, for particular use cases we will need to proceed like the example below. In the below case, we will define a new alphabet including all characters and the `β` char.  

For standart English documents, you won't need to define it, because under the hood `EntityRuler` annotator uses an English alphabet by default.

In [None]:
# Define a new alphabet

symbols = """:$&(){}[]?/\\!><@=#-;,%_“.|'`"*#^+~€"""
numbers = "0123456789"
englishAlphabet = "abcdefghijklmnopqrstuvwxyz"
special = "β"

chars = symbols + numbers + englishAlphabet + special

with open('./custom_alphabet.txt', 'w') as alphabet_file:
    alphabet_file.write(chars)

In [None]:
entity_ruler_custom_alphabet = nlp.EntityRulerApproach() \
    .setInputCols(["document"]) \
    .setOutputCol("ner_entity_ruler") \
    .setPatternsResource("./entities.json")\
    .setCaseSensitive(False)\
    .setAlphabetResource('./custom_alphabet.txt')

In [None]:
pipeline_custom_alphabet = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        entity_ruler_custom_alphabet,
        word_embeddings,
        ner_model,
        ner_converter,
        merger
])

model_custom_alphabet = pipeline_custom_alphabet.fit(data)

result_custom_alphabet = model_custom_alphabet.transform(data)

# get combined ner entities of EntityRuler and ner_jsl
get_ner_table(result_custom_alphabet, "ner_chunk")

Unnamed: 0,chunk,ner_label,sentence,begin,end
0,28 year old,Age,0,23,33
1,female,Female,0,35,40
2,gestational diabetes mellitus,Diabetes,0,60,88
3,eight years prior,RelativeDate,0,100,116
4,type two diabetes mellitus,Diabetes,0,149,174
...,...,...,...,...,...
119,girl,Female,0,2334,2337
120,four years,Age,15,2358,2367
121,he,Male,0,2376,2377
122,differences in velocity,Symptom,16,2401,2423


In [None]:
# Get ner entities of ner_jsl only
get_ner_table(result_custom_alphabet, "ner_jsl_chunk")

Unnamed: 0,chunk,ner_label,sentence,begin,end
0,28 year old,Age,0,23,33
1,female,Gender,0,35,40
2,gestational diabetes mellitus,Diabetes,0,60,88
3,eight years prior,RelativeDate,0,100,116
4,type two diabetes mellitus,Diabetes,0,149,174
...,...,...,...,...,...
105,at about\nseven months,RelativeDate,15,2298,2318
106,girl,Gender,15,2334,2337
107,four years,Age,15,2358,2367
108,differences in velocity,Symptom,16,2401,2423


Comparing above two tables (`ner_chunk` vs `ner_chunk_jsl`) , we have  added new `ID` entity and more granular of `Gender` entity as `Male` and `Female`.