![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/ContextualEntityRuler.ipynb)

#   **📜 ContextualEntityRuler**


The  **`ContextualEntityRuler`**  is an annotator that updates chunks based on contextual rules. These rules are defined in the form of dictionaries and can include prefixes, suffixes, and the context within a specified scope window around the chunks. This annotator modifies detected chunks by replacing their entity labels or content based on the patterns and rules if they match. It is particularly useful for refining entity recognition results according to specific needs.

**📖 Learning Objectives:**

1. Understand how to use the annotator.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

- Reference Documentation: [ContextualEntityRuler](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#contextualentityruler)


## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=False
nlp.install(refresh_install=True)

In [None]:
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_9596 (5).json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.5.1, 💊Spark-Healthcare==5.5.2, running on ⚡ PySpark==3.4.0


In [None]:
spark

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `CHUNK`, `TOKEN`

- Output: `CHUNK`

## **🔎 Parameters**


**Parameters**:

- `setCaseSensitive`: Whether to perform case-sensitive matching. Default is `False`.

- `setAllowPunctuationInBetween`: Whether to allow punctuation between prefix/suffix patterns and the entity. Default is `True`.

- `setDropEmptyChunks`: If `True`, removes chunks with empty content after applying rules. Default is `False`.

- `setCaseSensitive`: If `True`, the operation is case sensitive while checking the context. Default is `False`.

- `setMergeOverlapping`: If `False`, it returns both modified entities and the original entities at the same time. Default is `True`.

- `setallowTokensInBetween`: When True: Allows tokens between prefix/suffix patterns and the entity, enabling extended matches. When False: Tokens between patterns and entities prevent a match. Default: False

- `rules`: The updating rules. Each rule is a dictionary with the following keys:
  - `entity`: The target entity label to modify. Example: `"AGE"`.
  - `prefixPatterns`: Array of patterns (words/phrases) to match before the entity. Example: `["years", "old"]` matches entities preceded by "years" or "old."
  - `suffixPatterns`: Array of patterns (words/phrases) to match after the entity. Example: `["years", "old"]` matches entities followed by "years" or "old."
  - `scopeWindowLevel`: Specifies the level of the scope window to consider. Valid values: `"token"` or `"char"`. Default: `"token"`.
  - `scopeWindow`: A tuple defining the range of tokens or characters (based on `scopeWindowLevel`) to include in the scope. Default for `"token"` level: `(2, 2)`. Default for `"char"` level: `(10, 10)`. Example: `(2, 3)` means 2 tokens/characters before and 3 after the entity are considered.
  - `prefixRegexes`: Array of regular expressions to match before the entity. Example: `["\\b(years|months)\\b"]` matches words like "years" or "months" as prefixes.
  - `suffixRegexes`: Array of regular expressions to match after the entity. Example: `["\\b(old|young)\\b"]` matches words like "old" or "young" as suffixes.
  - `replaceEntity`: Optional string specifying the new entity label to replace with the target entity label. Example: `"MODIFIED_AGE"` replaces `"AGE"` with `"MODIFIED_AGE"` in matching cases.
  - `mode`: Specifies the operational mode for the rules. Possible values depend on the use case (e.g., `"include"`, `"exclude"`, `"replace_label_only"` ). Default: `"include"`.



  

### Pipeline

In [None]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") \

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_generic_augmented download started this may take some time.
[OK!]


In [None]:
rules = [
    {
        "entity" : "NAME",
        "scopeWindow" : [6,6],
        "scopeWindowLevel"  : "token",
        "prefixPatterns" : ["Doctor"],
        "replaceEntity" : "REPLACED_NAME",
        "mode" : "include"

    },
    {
        "entity" : "DATE",
        "scopeWindow" : [6,6],
        "scopeWindowLevel"  : "token",
        "suffixPatterns" : ["with"],
        "replaceEntity" : "REPLACED_DATE",
        "mode" : "include"

    },
    {
        "entity" : "AGE",
        "scopeWindow" : [60,60],
        "scopeWindowLevel"  : "char",
        "suffixPatterns" : ["patient"],
        "replaceEntity" : "REPLACED_AGE",
    }
]

contextual_entity_ruler = medical.ContextualEntityRuler() \
            .setInputCols("sentence", "token", "ner_chunk") \
            .setOutputCol("ruled_ner_chunk") \
            .setRules(rules) \
            .setCaseSensitive(False)\
            .setDropEmptyChunks(True)\
            .setAllowPunctuationInBetween(False)\
            .setAllowTokensInBetween(True)


pipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        contextual_entity_ruler,
        ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)


text = """ The Doctor who is John Snow, assessed the 36 years old patient on November 25, 2024, who presented with symptoms of the immune system with tender areas.
       """
data = spark.createDataFrame([[text]]).toDF("text")

result = model.transform(data)


In [None]:
import pyspark.sql.functions as F

In [None]:
ner_chunk_df = result.select(F.explode(F.arrays_zip(
                          result.ner_chunk.result,
                          result.ner_chunk.begin,
                          result.ner_chunk.end,
                          result.ner_chunk.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']").alias("begin"),
                          F.expr("cols['2']").alias("end"),
                          F.expr("cols['3']['entity']").alias("ner_label"),
                          F.expr("cols['3']['confidence']").alias("confidence"))

ner_chunk_df.show(50, truncate=100)

+-----------------+-----+---+---------+----------+
|            chunk|begin|end|ner_label|confidence|
+-----------------+-----+---+---------+----------+
|        John Snow|   19| 27|     NAME|0.98730004|
|               36|   43| 44|      AGE|    0.9997|
|November 25, 2024|   67| 83|     DATE|0.98829997|
+-----------------+-----+---+---------+----------+



In [None]:
ner_chunk_df = result.select(F.explode(F.arrays_zip(
                          result.ruled_ner_chunk.result,
                          result.ruled_ner_chunk.begin,
                          result.ruled_ner_chunk.end,
                          result.ruled_ner_chunk.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']").alias("begin"),
                          F.expr("cols['2']").alias("end"),
                          F.expr("cols['3']['entity']").alias("ner_label"),
                          F.expr("cols['3']['confidence']").alias("confidence"))

ner_chunk_df.show(50, truncate=100)

+-----------------------+-----+---+-------------+----------+
|                  chunk|begin|end|    ner_label|confidence|
+-----------------------+-----+---+-------------+----------+
|Doctor who is John Snow|    5| 27|REPLACED_NAME|0.98730004|
|   36 years old patient|   43| 62| REPLACED_AGE|    0.9997|
|      November 25, 2024|   67| 83|         DATE|0.98829997|
+-----------------------+-----+---+-------------+----------+



In [None]:
ner_jsl = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_jsl") \

ner_jsl_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner_jsl"]) \
    .setOutputCol("ner_jsl_chunk")

chunk_merger = medical.ChunkMergeApproach()\
    .setInputCols("ner_chunk", "ner_jsl_chunk")\
    .setOutputCol('ner_chunk_merged')\
    .setBlackList(["DATE"])

rules = [
	{
		"entity" : "Age",
		"scopeWindow" : [15,15],
		"scopeWindowLevel"  : "char",
		"suffixPatterns" : ["years old", "year old", "months"],
		"replaceEntity" : "Modified_Age",
		"mode" : "exclude"
	},
	{
		"entity" : "Diabetes",
		"scopeWindow" : [3,3],
		"scopeWindowLevel"  : "token",
		"suffixPatterns" : ["complications"],
		"replaceEntity" : "Modified_Diabetes",
		"mode" : "exclude"
	},
	{
		"entity" : "NAME",
		"scopeWindow" : [3,3],
		"scopeWindowLevel"  : "token",
		"prefixPatterns" : ["MD","M.D", "Dr"],
		"replaceEntity" : "Doctor_Name",
		"mode" : "include"
	}
]

contextual_entity_ruler = medical.ContextualEntityRuler() \
    .setInputCols("sentence", "token", "ner_chunk") \
    .setOutputCol("ruled_ner_chunk") \
    .setRules(rules) \
    .setCaseSensitive(False)\
    .setDropEmptyChunks(True)\
    .setAllowPunctuationInBetween(True)\
    .setAllowTokensInBetween(True)

pipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        ner_jsl,
        ner_jsl_converter,
        chunk_merger,
        contextual_entity_ruler,
        ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

text = """ Dr. John Snow assessed the 36 years old who has a history of the diabetes mellitus with complications in May, 2006"""
data = spark.createDataFrame([[text]]).toDF("text")

result = model.transform(data)

ner_jsl download started this may take some time.
[OK!]


In [None]:
ner_chunk_df = result.select(F.explode(F.arrays_zip(
                          result.ner_chunk_merged.result,
                          result.ner_chunk_merged.begin,
                          result.ner_chunk_merged.end,
                          result.ner_chunk_merged.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']").alias("begin"),
                          F.expr("cols['2']").alias("end"),
                          F.expr("cols['3']['entity']").alias("ner_label"),
                          F.expr("cols['3']['confidence']").alias("confidence"))

ner_chunk_df.show(50, truncate=100)

ner_chunk_df = result.select(F.explode(F.arrays_zip(
                          result.ruled_ner_chunk.result,
                          result.ruled_ner_chunk.begin,
                          result.ruled_ner_chunk.end,
                          result.ruled_ner_chunk.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']").alias("begin"),
                          F.expr("cols['2']").alias("end"),
                          F.expr("cols['3']['entity']").alias("ner_label"),
                          F.expr("cols['3']['confidence']").alias("confidence"))

ner_chunk_df.show(50, truncate=100)

+-----------------+-----+---+---------+----------+
|            chunk|begin|end|ner_label|confidence|
+-----------------+-----+---+---------+----------+
|        John Snow|    5| 13|     NAME|   0.88095|
|     36 years old|   28| 39|      Age| 0.8915334|
|diabetes mellitus|   66| 82| Diabetes|    0.9742|
+-----------------+-----+---+---------+----------+

+-------------+-----+---+-----------+----------+
|        chunk|begin|end|  ner_label|confidence|
+-------------+-----+---+-----------+----------+
|Dr. John Snow|    1| 13|Doctor_Name|   0.88095|
|           36|   28| 29|        AGE|    0.9988|
|    May, 2006|  106|114|       DATE|0.91326666|
+-------------+-----+---+-----------+----------+



In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

ner = medical.NerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models") \
    .setInputCols(["document", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("ner_chunk")

This time we will use `"replace_label_only"` option to the `mode`
 parameter

In [None]:
data = [
    (1, "Performed on Oct 22 , 2024 13:14"),
    (2, "PHYSICAL EXAMINATION : Performed on Oct 22 , 2024 12:49"),
    (3, "Oct 22 , 20241:24PM Faisal Akbar , M.D ."),
    (4, "Electronically Reviewed by : Navkiranjit Gill, D.O. on\n10/22/2024 13:28:"),
    (5, "Electronically Reviewed By : KHANH T NGUYEN Oct 22 , 2024 1:28PM Electronically Reviewed by : KHANH NGUYEN , M.D . on 10/22/2024 13:28:27"),
    (6, "Labs ordered by KHANH T . NGUYEN on Oct 22 , 2024 :"),
    (7, "NEW ORDERS TODAY : Return Visit ordered by KHANH T . NGUYEN on Oct 22 , 2024 :"),
    (8, "PERCEPTA testing by Pulm revealed high risk for malignancy and patient was therefore referred to CTS Dr Lazar"),
    (9, "Sign - Requested by SOHNEN MD , ADAM E ( on )"),
    (10, "Patient Name PEET , RODNEY W Procedure # 24509 Attending Doctor CATER , GEORGE MRN 980615798 Appointment 10/Oct/2024 10:30AM Ordering Provider CATER , GEORGE"),
    (11, "Electronically signed by Dr . David Liddle on 18-Oct-2024 at 08:39"),
    (12, "Ordering Provider : Lee M ."),
    (13, "Patient Name : VANCHERI , THERESA L Date of Study : 10-22-2024")
]


columns = ["id", "text"]
text_df = spark.createDataFrame(data, columns)

document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

ner = medical.NerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models") \
    .setInputCols(["document", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("ner_chunk")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_subentity_augmented download started this may take some time.
[OK!]


In [None]:
json_rules = """
[
  {
    "entity": "DATE",
    "scopeWindow": [5, 5],
    "scopeWindowLevel": "token",
    "prefixPatterns": ["Performed On", "Created On :"],
    "replaceEntity": "REPLACED_DATE",
    "mode": "replace_label_only"
  },
  {
    "entity": "DOCTOR",
    "scopeWindow": [5, 5],
    "scopeWindowLevel": "token",
    "suffixRegexes": ["D.O.", "M.A"],
    "replaceEntity": "REPLACED_DOCTOR",
    "mode": "replace_label_only"
  }
]
"""

contextual_entity_ruler = medical.ContextualEntityRuler() \
    .setInputCols(["document", "token", "ner_chunk"]) \
    .setOutputCol("updated_chunk") \
    .setRulesAsStr(json_rules) \
    .setCaseSensitive(False) \
    .setAllowPunctuationInBetween(True) \
    .setDropEmptyChunks(False) \
    .setAllowTokensInBetween(True)


pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    contextual_entity_ruler
])

model = pipeline.fit(text_df)
result_df = model.transform(text_df).cache()


In [None]:
result_df.select("id", "ner_chunk").show(truncate=False)
result_df.select("id", "updated_chunk").show(truncate=False)

+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |ner_chunk                                                                                                                                                                                                                                  

In [None]:
flattener = (
    medical.Flattener()
    .setInputCols("ner_chunk", "updated_chunk")
    .setExplodeSelectedFields({"ner_chunk": ["result as ner_chunk",
                                             "begin as begin",
                                             "end as end",
                                             "metadata.entity as entity"],
                               "updated_chunk": ["result as updated_chunk",
                                             "begin as updated_begin",
                                             "end as updated_end",
                                             "metadata.entity as updated_entity"]
                               })
    .setKeepOriginalColumns(["id"])
)

flattener.transform(result_df).show(truncate=False)

+----------------+-----+---+-------+----------------+-------------+-----------+--------------+---+
|ner_chunk       |begin|end|entity |updated_chunk   |updated_begin|updated_end|updated_entity|id |
+----------------+-----+---+-------+----------------+-------------+-----------+--------------+---+
|Oct 22 , 2024   |13   |25 |DATE   |Oct 22 , 2024   |13           |25         |REPLACED_DATE |1  |
|Oct 22 , 2024   |36   |48 |DATE   |Oct 22 , 2024   |36           |48         |REPLACED_DATE |2  |
|Oct 22          |0    |5  |DATE   |Oct 22          |0            |5          |DATE          |3  |
|Faisal Akbar    |20   |31 |NAME   |Faisal Akbar    |20           |31         |NAME          |3  |
|Navkiranjit Gill|29   |44 |NAME   |Navkiranjit Gill|29           |44         |NAME          |4  |
|10/22/2024      |55   |64 |DATE   |10/22/2024      |55           |64         |DATE          |4  |
|KHANH T NGUYEN  |29   |42 |NAME   |KHANH T NGUYEN  |29           |42         |NAME          |5  |
|Oct 22 , 

In [None]:
json_rules = """
[
  {
    "entity": "DATE",
    "scopeWindow": [5, 5],
    "scopeWindowLevel": "token",
    "prefixPatterns": ["Performed On", "Created On :"],
    "replaceEntity": "REPLACED_DATE",
    "mode": "include"
  },
  {
    "entity": "DOCTOR",
    "scopeWindow": [5, 5],
    "scopeWindowLevel": "token",
    "suffixRegexes": ["D.O.", "M.A"],
    "replaceEntity": "REPLACED_DOCTOR",
    "mode": "include"
  }
]
"""

contextual_entity_ruler = medical.ContextualEntityRuler() \
    .setInputCols(["document", "token", "ner_chunk"]) \
    .setOutputCol("updated_chunk") \
    .setRulesAsStr(json_rules) \
    .setCaseSensitive(False) \
    .setAllowPunctuationInBetween(True) \
    .setDropEmptyChunks(False) \
    .setAllowTokensInBetween(True)

# Pipeline tanımlanıyor
pipeline = nlp.Pipeline().setStages([
    document_assembler,
    tokenizer,
    word_embeddings,
    ner,
    ner_converter,
    contextual_entity_ruler
])

# Pipeline uygulanıyor
model = pipeline.fit(text_df)
result_df = model.transform(text_df).cache()

In [None]:
result_df.select("id", "ner_chunk").show(truncate=False)
result_df.select("id", "updated_chunk").show(truncate=False)

+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |ner_chunk                                                                                                                                                                                                                                                                                                                                       

In [None]:
flattener = (
    medical.Flattener()
    .setInputCols("ner_chunk", "updated_chunk")
    .setExplodeSelectedFields({"ner_chunk": ["result as ner_chunk",
                                             "begin as begin",
                                             "end as end",
                                             "metadata.entity as entity"],
                               "updated_chunk": ["result as updated_chunk",
                                             "begin as updated_begin",
                                             "end as updated_end",
                                             "metadata.entity as updated_entity"]
                               })
    .setKeepOriginalColumns(["id"])
)

flattener.transform(result_df).show(truncate=False)

+----------------+-----+---+-------------+----------------------+-------------+-----------+---------------+---+
|ner_chunk       |begin|end|entity       |updated_chunk         |updated_begin|updated_end|updated_entity |id |
+----------------+-----+---+-------------+----------------------+-------------+-----------+---------------+---+
|Oct 22 , 2024   |13   |25 |DATE         |Oct 22 , 2024         |13           |25         |DATE           |1  |
|Oct 22 , 2024   |36   |48 |DATE         |Oct 22 , 2024         |36           |48         |DATE           |2  |
|Oct 22          |0    |5  |DATE         |Oct 22                |0            |5          |DATE           |3  |
|Faisal Akbar    |20   |31 |DOCTOR       |Faisal Akbar          |20           |31         |DOCTOR         |3  |
|Navkiranjit Gill|29   |44 |DOCTOR       |Navkiranjit Gill, D.O.|29           |50         |REPLACED_DOCTOR|4  |
|10/22/2024      |55   |64 |DATE         |10/22/2024            |55           |64         |DATE         