![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/47.Contextual_Entity_Ruler.ipynb)

# 📜ContextualEntityRuler



ContextualEntityRuler is an annotator that updates chunks based on contextual rules.

These rules are defined in the form of dictionaries and can include prefixes, suffixes, and the context within a specified scope window around the chunks.

This annotator modifies detected chunks by replacing their entity labels or content based on the patterns and rules if they mathces. It is particularly useful for refining entity recognition results according to specific needs.



## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=False
nlp.install(refresh_install=True)

In [4]:
from johnsnowlabs import nlp, medical
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/6.1.1.spark_nlp_for_healthcare.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==6.1.3, 💊Spark-Healthcare==6.1.1, running on ⚡ PySpark==3.4.0


In [5]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only



## **🖨️ Input/Output Annotation Types**

- Input: `SENTECE`, `TOKEN`, `CHUNK`

- Output: `ASSERTION`

## **🔎 Parameters**

**Parameters**:

- `setCaseSensitive`: Whether to perform case-sensitive matching. Default is False.  
- `setAllowPunctuationInBetween`: Whether to allow punctuation between prefix/suffix patterns and the entity. Default is True.
- `setDropEmptyChunks`: If True, removes chunks with empty content after applying rules. Default is False.
- `setCaseSensitive`: If True, it is case sensitive while checking the context. Default is False.
- `setMergeOverlapping`: If False, it returns both modified entities and the original entities at the same time. Default is True.

  For example, if a chunk like "September" is matched by a prefix pattern "September", and the mode is set to 'exclude', the chunk will be excluded. After exclusion, the decision is made whether to drop the chunk if its content is empty, or keep it unchanged.

**Rule Settings:**

- `entity`: The target entity label to modify.  
  Example: `"AGE"`.
- `prefixPatterns`: Array of patterns (words/phrases) to match **before the entity**.  
  Example: `["years", "old"]` matches entities preceded by "years" or "old."
- `suffixPatterns`: Array of patterns (words/phrases) to match **after the entity**.  
  Example: `["years", "old"]` matches entities followed by "years" or "old."
- `scopeWindowLevel`: Specifies the level of the scope window to consider.  
  Valid values: `"token"` or `"char"`. Default: `"token"`.
- `scopeWindow`: A tuple defining the range of tokens or characters (based on `scopeWindowLevel`) to include in the scope.  
  Default for "token" level: `(2, 2)`.
  Default for "char" level: `(10,10)`
  Example: `(2, 3)` means 2 tokens/characters before and 3 after the entity are considered.  
- `prefixRegexes`: Array of regular expressions to match **before the entity**.  
  Example: `["\\b(years|months)\\b"]` matches words like "years" or "months" as prefixes.
- `suffixRegexes`: Array of regular expressions to match **after the entity**.  
  Example: `["\\b(old|young)\\b"]` matches words like "old" or "young" as suffixes.
- `prefixEntities` : Entities to match before the entity.
- `suffixEntities` : Entities to match after the entity.
- `regexInBetween` (str): Regular expression to match text between the entity and prefix/suffix. If matched, the prefix/suffix entities will be included with the target entity.  
- `replaceEntity`: Optional string specifying the new entity label to replace with the target entity label.  
  Example: `"MODIFIED_AGE"` replaces `"AGE"` with `"MODIFIED_AGE"` in matching cases.
- `mode`: Specifies the operational mode for the rules.  
  Possible values depend on the use case (e.g., `"include"`, `"exclude"`).
  Default: `"include"`


## Goal

Let's assume that we want to create a tabular data with the clinical text below. This tabular data should contain the patient's demographic information and clinical history.

In [6]:
text = """The patient is a 41 years old Vietnamese female with a nonproductive cough that started last week.
She has a history of the diabetes mellitus with complications in May, 2006 and went to an urgent care center.
Chest x-ray revealed right-sided pleural effusion."""

## Base Pipeline

In [7]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

jsl_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("jsl_ner") \

jsl_ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "jsl_ner"]) \
    .setOutputCol("ner_chunks")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
Approximate size to download 14.5 MB
[OK!]


## Without ContextualEntityRuler

In [8]:
pipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        jsl_ner,
        jsl_ner_converter,
        ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

Let's test NER pipeline on a sample text.

In [9]:

print(text)
data = spark.createDataFrame([[text]]).toDF("text")

result = model.transform(data)
result.show()

The patient is a 41 years old Vietnamese female with a nonproductive cough that started last week.
She has a history of the diabetes mellitus with complications in May, 2006 and went to an urgent care center.
Chest x-ray revealed right-sided pleural effusion.
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|             jsl_ner|          ner_chunks|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|The patient is a ...|[{document, 0, 25...|[{document, 0, 97...|[{token, 0, 2, Th...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 17, 28, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------

In [10]:
result.select(F.explode(F.arrays_zip(result.ner_chunks.result,
                                     result.ner_chunks.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("ner_chunk"),
              F.expr("cols['1']['entity']").alias("label")).show(truncate=False)

+------------------+-------------------------+
|ner_chunk         |label                    |
+------------------+-------------------------+
|41 years old      |Age                      |
|Vietnamese        |Race_Ethnicity           |
|female            |Gender                   |
|nonproductive     |Modifier                 |
|cough             |Symptom                  |
|last week         |RelativeDate             |
|She               |Gender                   |
|diabetes mellitus |Diabetes                 |
|May               |Date                     |
|2006              |Date                     |
|urgent care center|Clinical_Dept            |
|Chest x-ray       |Test                     |
|right-sided       |Direction                |
|pleural effusion  |Disease_Syndrome_Disorder|
+------------------+-------------------------+



## With `ContextualEntityRuler`

If we want to have:

- "Age" as only digit
- "Diabetes" with complication
- Concat "May" and "2006" to have a single date

we can use `ContextualEntityRuler` annotator.

In [11]:
rules = [
    {
        "entity" : "Age",
        "scopeWindow" : [15,15],
        "scopeWindowLevel"  : "char",
        "suffixPatterns" : ["years old", "year old", "months",],
        "replaceEntity" : "Modified_Age",
        "mode" : "exclude"
    },
    {
        "entity" : "Diabetes",
        "scopeWindow" : [3,3],
        "scopeWindowLevel"  : "token",
        "suffixPatterns" : ["with complications"],
        "replaceEntity" : "Modified_Diabetes",
        "mode" : "include"

    },
    {
        "entity" : "Date",
        "suffixRegexes" : ["\\d{4}"],
        "replaceEntity" : "Modified_Date",
        "mode" : "include"
    }
]

In [12]:
contextual_entity_ruler = medical.ContextualEntityRuler() \
    .setInputCols("sentence", "token", "ner_chunks") \
    .setOutputCol("ruled_ner_chunks") \
    .setRules(rules) \
    .setCaseSensitive(False)\
    .setDropEmptyChunks(True)\
    .setAllowPunctuationInBetween(True)


ruler_pipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        jsl_ner,
        jsl_ner_converter,
        contextual_entity_ruler
        ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
ruler_model = ruler_pipeline.fit(empty_data)

ruler_result = ruler_model.transform(data)
ruler_result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|             jsl_ner|          ner_chunks|    ruled_ner_chunks|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|The patient is a ...|[{document, 0, 25...|[{document, 0, 97...|[{token, 0, 2, Th...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 17, 28, ...|[{chunk, 17, 18, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [13]:
ruler_result.select("ner_chunks").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [14]:
ruler_result.select("ruled_ner_chunks").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [15]:
print("BEFORE CONTEXTUAL ENTITY RULER:")
ruler_result.select(F.explode(F.arrays_zip(ruler_result.ner_chunks.result,
                                           ruler_result.ner_chunks.metadata
                                           )).alias("cols"))\
            .select(F.expr("cols['0']").alias("ner_chunks"),
                    F.expr("cols['1']['entity']").alias("labels")
                    ).show(truncate=False)

print("*"*50, "\n")

print("AFTER CONTEXTUAL ENTITY RULER:")
ruler_result.select(F.explode(F.arrays_zip(ruler_result.ruled_ner_chunks.result,
                                           ruler_result.ruled_ner_chunks.metadata
                                           )).alias("cols"))\
            .select(F.expr("cols['0']").alias("ner_chunks"),
                    F.expr("cols['1']['entity']").alias("labels")
                    ).show(truncate=False)

BEFORE CONTEXTUAL ENTITY RULER:
+------------------+-------------------------+
|ner_chunks        |labels                   |
+------------------+-------------------------+
|41 years old      |Age                      |
|Vietnamese        |Race_Ethnicity           |
|female            |Gender                   |
|nonproductive     |Modifier                 |
|cough             |Symptom                  |
|last week         |RelativeDate             |
|She               |Gender                   |
|diabetes mellitus |Diabetes                 |
|May               |Date                     |
|2006              |Date                     |
|urgent care center|Clinical_Dept            |
|Chest x-ray       |Test                     |
|right-sided       |Direction                |
|pleural effusion  |Disease_Syndrome_Disorder|
+------------------+-------------------------+

************************************************** 

AFTER CONTEXTUAL ENTITY RULER:
+------------------------------------

As you can see from the results;
- "years old" was removed from the `Age` entity.
- "diabetes mellitus" and "with complications" were merged.
- "May, 2006" date was merged. Even though there was "," between them, since we set `setAllowPunctuationInBetween(True)`, the punctuation was discarded.

## LightPipeline

Now we will create LightPipeline and visualize its results.

In [16]:
lmodel = nlp.LightPipeline(ruler_model)

light_result = lmodel.fullAnnotate(text)

In [17]:
"Before Contextual Entity Ruler".upper()

'BEFORE CONTEXTUAL ENTITY RULER'

In [18]:
from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

print("### BEFORE CONTEXTUAL ENTITY RULER: ###\n")
visualiser.display(light_result[0], label_col='ner_chunks', document_col='document')

print("*"*100,"\n")

print("### AFTER CONTEXTUAL ENTITY RULER: ###\n")
visualiser.display(light_result[0], label_col='ruled_ner_chunks', document_col='document')

### BEFORE CONTEXTUAL ENTITY RULER: ###



**************************************************************************************************** 

### AFTER CONTEXTUAL ENTITY RULER: ###



## Entity Support and RegexInBetween

In [19]:
text = """Los Angeles, zip code 90001, is located in the South Los Angeles region of the city."""

**Explanation**

**Goal**: Merge "LOCATION" and "CONTACT" entities if they are connected by the word "zip".

**How**: This rule searches for the pattern "zip" between "LOCATION" and "CONTACT" entities within a 6-token window.


**Effect**: The entities are merged and labeled as "REPLACED_LOC".

In [20]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") \

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunks")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_generic_augmented download started this may take some time.
Approximate size to download 13.8 MB
[OK!]


In [21]:
rules = [
    {
        "entity" : "LOCATION",
        "scopeWindow" : [6,6],
        "scopeWindowLevel"  : "token",
        "regexInBetween": "^zip$",
        "suffixEntities" : ["CONTACT","IDNUM"],
        "replaceEntity" : "REPLACED_LOC",
        "mode" : "include",
    }
]

In [22]:
contextual_entity_ruler = medical.ContextualEntityRuler() \
    .setInputCols("sentence", "token", "ner_chunks") \
    .setOutputCol("ruled_ner_chunks") \
    .setRules(rules) \
    .setCaseSensitive(False)\
    .setDropEmptyChunks(True)\


ruler_pipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        contextual_entity_ruler
        ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
ruler_model = ruler_pipeline.fit(empty_data)

data = spark.createDataFrame([[text]]).toDF("text")

ruler_result = ruler_model.transform(data)
ruler_result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|          ner_chunks|    ruled_ner_chunks|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Los Angeles, zip ...|[{document, 0, 83...|[{document, 0, 83...|[{token, 0, 2, Lo...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 0, 10, L...|[{chunk, 0, 10, L...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [23]:
print("BEFORE CONTEXTUAL ENTITY RULER:")
ruler_result.select(F.explode(F.arrays_zip(ruler_result.ner_chunks.result,
                                           ruler_result.ner_chunks.metadata
                                           )).alias("cols"))\
            .select(F.expr("cols['0']").alias("ner_chunks"),
                    F.expr("cols['1']['entity']").alias("labels")
                    ).show(truncate=False)

print("*"*50, "\n")

print("AFTER CONTEXTUAL ENTITY RULER:")
ruler_result.select(F.explode(F.arrays_zip(ruler_result.ruled_ner_chunks.result,
                                           ruler_result.ruled_ner_chunks.metadata
                                           )).alias("cols"))\
            .select(F.expr("cols['0']").alias("ner_chunks"),
                    F.expr("cols['1']['entity']").alias("labels")
                    ).show(truncate=False)

BEFORE CONTEXTUAL ENTITY RULER:
+-----------------+--------+
|ner_chunks       |labels  |
+-----------------+--------+
|Los Angeles      |LOCATION|
|90001            |CONTACT |
|South Los Angeles|LOCATION|
+-----------------+--------+

************************************************** 

AFTER CONTEXTUAL ENTITY RULER:
+-----------------+--------+
|ner_chunks       |labels  |
+-----------------+--------+
|Los Angeles      |LOCATION|
|90001            |CONTACT |
|South Los Angeles|LOCATION|
+-----------------+--------+



**Explanation**

**Goal**: Change the label of "CONTACT" to "ZIP_CODE" without altering the text.

**How**: Using a 6-token window, the rule checks if a "LOCATION" entity precedes the "CONTACT" entity.

**Effect**: Only the entity label is modified.





In [24]:
rules = [
    {
        "entity" : "CONTACT",
        "scopeWindow" : [6,6],
        "scopeWindowLevel"  : "token",
        "prefixEntities" : ["LOCATION"],
        "replaceEntity" : "ZIP_CODE",
        "mode" : "replace_label_only"
    }
]

In [25]:
contextual_entity_ruler = medical.ContextualEntityRuler() \
    .setInputCols("sentence", "token", "ner_chunks") \
    .setOutputCol("ruled_ner_chunks") \
    .setRules(rules) \
    .setCaseSensitive(False)\
    .setDropEmptyChunks(True)\
    .setAllowTokensInBetween(True)


ruler_pipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        contextual_entity_ruler
        ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
ruler_model = ruler_pipeline.fit(empty_data)

data = spark.createDataFrame([[text]]).toDF("text")

ruler_result = ruler_model.transform(data)
ruler_result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|          ner_chunks|    ruled_ner_chunks|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Los Angeles, zip ...|[{document, 0, 83...|[{document, 0, 83...|[{token, 0, 2, Lo...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 0, 10, L...|[{chunk, 0, 10, L...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [26]:
print("BEFORE CONTEXTUAL ENTITY RULER:")
ruler_result.select(F.explode(F.arrays_zip(ruler_result.ner_chunks.result,
                                           ruler_result.ner_chunks.metadata
                                           )).alias("cols"))\
            .select(F.expr("cols['0']").alias("ner_chunks"),
                    F.expr("cols['1']['entity']").alias("labels")
                    ).show(truncate=False)

print("*"*50, "\n")

print("AFTER CONTEXTUAL ENTITY RULER:")
ruler_result.select(F.explode(F.arrays_zip(ruler_result.ruled_ner_chunks.result,
                                           ruler_result.ruled_ner_chunks.metadata
                                           )).alias("cols"))\
            .select(F.expr("cols['0']").alias("ner_chunks"),
                    F.expr("cols['1']['entity']").alias("labels")
                    ).show(truncate=False)

BEFORE CONTEXTUAL ENTITY RULER:
+-----------------+--------+
|ner_chunks       |labels  |
+-----------------+--------+
|Los Angeles      |LOCATION|
|90001            |CONTACT |
|South Los Angeles|LOCATION|
+-----------------+--------+

************************************************** 

AFTER CONTEXTUAL ENTITY RULER:
+-----------------+--------+
|ner_chunks       |labels  |
+-----------------+--------+
|Los Angeles      |LOCATION|
|90001            |ZIP_CODE|
|South Los Angeles|LOCATION|
+-----------------+--------+

