![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/ContextualEntityFilterer.ipynb)

#   **📜 ContextualEntityFilterer**


The  **`ContextualEntityFilterer`** annotator was developed to prevent certain entities from causing interference. It can be used to filter out specific entities, ensuring accurate results are preserved.

**📖 Learning Objectives:**

1. Understand how to use the annotator.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

- Reference Documentation: [ContextualEntityFilterer](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators)


## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=False
nlp.install(refresh_install=True)

In [4]:
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_9596 (5).json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.5.1, 💊Spark-Healthcare==5.5.2, running on ⚡ PySpark==3.4.0


In [5]:
spark

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `CHUNK`, `TOKEN`

- Output: `CHUNK`

## **🔎 Parameters**


**Parameters**:

- `ruleScope`: The rule scope to apply the filter. Options: sentence, document.(str)
- `rules`: list[dict]
         
        - `entity`: The target entity field for filtering.
        - `scopeWindow`: A list of two integers [before, after], specifying how many tokens/chunks before and after the target to consider.
        - `whiteListEntities`: The white list of entities. If one of the entity from this list appears within the scope window, the chunk will be kept. Only one element is enough to keep the chunk.
        - `blackListEntities`: The black list of entities. If an entity from this list appears within the scope window, the chunk will be filtered out. All elements must be absent to keep the chunk.
        - `scopeWindowLevel`: Determines whether the `scopeWindow` is applied at the token or chunk level. Options: `token`, `chunk`.
        - `blackListWords`: The black list of words. If a word from this list appears within the scope window, the chunk will be filtered out.
        - `whiteListWords`: The white list of words. If a word from this list appears within the scope window, the chunk will be kept.
        - `confidenceThreshold`: The confidence threshold to filter the chunks. Filtering is only applied if the confidence of the chunk is below the threshold.
        - `possibleRegexContext`: The possible regex context to filter the chunks. If the regex is found in the context(chunk), the chunk is kept.
        - `impossibleRegexContext`:The impossible regex context to filter the chunks. If the regex is found in the context(chunk), the chunk is removed.When defining regex patterns in code, use double escape characters (e.g., \) to ensure proper handling of special characters.


  

### Pipeline

In [6]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["document"])\
      .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
      .setInputCols(["document", "token"])\
      .setOutputCol("embeddings")

ner_deid = medical.NerModel.pretrained("ner_deid_subentity_docwise", "en", "clinical/models")  \
      .setInputCols(["document", "token", "embeddings"]) \
      .setOutputCol("ner_deid_subentity_docwise")

ner_deid_converter = medical.NerConverterInternal()\
      .setInputCols(["document", "token", "ner_deid_subentity_docwise"])\
      .setOutputCol("ner_chunk_subentity_docwise")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_subentity_docwise download started this may take some time.
[OK!]


### `setRules`

In [7]:
rules =[{       "entity": "STATE",
                "scopeWindow": [2, 2],
                "whiteList": ["CITY"],
                "blackList": ["NAME"],
                "scopeWindowLevel": "token"
            }]

In [8]:
contextual_entity_filterer = medical.ContextualEntityFilterer() \
    .setInputCols("document", "token", "ner_chunk_subentity_docwise") \
    .setOutputCol("filtered_ner_chunks") \
    .setRules(rules)\
    .setRuleScope("sentence") # document

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      tokenizer,
      word_embeddings,
      ner_deid,
      ner_deid_converter,
      contextual_entity_filterer
      ])

In [9]:
text = "NY, a 34-year-old woman, Dr. Michael Johnson cares wit her, at CarePlus Clinic, located at 456 Elm Street, NewYork, NY has recommended starting insulin therapy."
df = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(df).transform(df).cache()

In [10]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------------+---------------------------+--------------------+
|                text|            document|               token|          embeddings|ner_deid_subentity_docwise|ner_chunk_subentity_docwise| filtered_ner_chunks|
+--------------------+--------------------+--------------------+--------------------+--------------------------+---------------------------+--------------------+
|NY, a 34-year-old...|[{document, 0, 15...|[{token, 0, 1, NY...|[{word_embeddings...|      [{named_entity, 0...|       [{chunk, 0, 1, NY...|[{chunk, 0, 1, NY...|
+--------------------+--------------------+--------------------+--------------------+--------------------------+---------------------------+--------------------+



In [11]:
result.selectExpr("explode(ner_chunk_subentity_docwise) as ner_chunk").show(50,truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner_chunk                                                                                                                                              |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 0, 1, NY, {entity -> STATE, confidence -> 0.9299, ner_source -> ner_chunk_subentity_docwise, chunk -> 0, sentence -> 0}, []}                   |
|{chunk, 6, 16, 34-year-old, {entity -> AGE, confidence -> 0.7687, ner_source -> ner_chunk_subentity_docwise, chunk -> 1, sentence -> 0}, []}           |
|{chunk, 29, 43, Michael Johnson, {entity -> DOCTOR, confidence -> 0.89965, ner_source -> ner_chunk_subentity_docwise, chunk -> 2, sentence -> 0}, []}  |
|{chunk, 63, 77, CarePlus Clinic, {entity -> HOSPITAL, confidence -> 0.9661,

In [12]:
result.selectExpr("explode(filtered_ner_chunks) as filtered_chunks").show(50,truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
|filtered_chunks                                                                                                                                        |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 0, 1, NY, {chunk -> 0, confidence -> 0.9299, ner_source -> ner_chunk_subentity_docwise, entity -> STATE, sentence -> 0}, []}                   |
|{chunk, 6, 16, 34-year-old, {chunk -> 1, confidence -> 0.7687, ner_source -> ner_chunk_subentity_docwise, entity -> AGE, sentence -> 0}, []}           |
|{chunk, 29, 43, Michael Johnson, {chunk -> 2, confidence -> 0.89965, ner_source -> ner_chunk_subentity_docwise, entity -> DOCTOR, sentence -> 0}, []}  |
|{chunk, 63, 77, CarePlus Clinic, {chunk -> 3, confidence -> 0.9661, ner_sou

In [13]:
flattener = medical.Flattener()\
    .setInputCols("ner_chunk_subentity_docwise") \
    .setExplodeSelectedFields({"ner_chunk_subentity_docwise": ["result as chunk",
                                                                "begin as begin",
                                                                "end as end",
                                                                "metadata.entity as ner_label",
                                                                "metadata.confidence as confidence"]})

In [14]:
flattener.transform(result).show(truncate=False)

+---------------+-----+---+---------+----------+
|chunk          |begin|end|ner_label|confidence|
+---------------+-----+---+---------+----------+
|NY             |0    |1  |STATE    |0.9299    |
|34-year-old    |6    |16 |AGE      |0.7687    |
|Michael Johnson|29   |43 |DOCTOR   |0.89965   |
|CarePlus Clinic|63   |77 |HOSPITAL |0.9661    |
|456 Elm Street |91   |104|STREET   |0.7733667 |
|NewYork        |107  |113|CITY     |0.9302    |
|NY             |116  |117|STATE    |0.9991    |
+---------------+-----+---+---------+----------+



In [15]:
flattener = medical.Flattener()\
    .setInputCols("filtered_ner_chunks") \
    .setExplodeSelectedFields({"filtered_ner_chunks": ["result as chunk",
                                                       "begin as begin",
                                                       "end as end",
                                                       "metadata.entity as ner_label",
                                                       "metadata.confidence as confidence"]})
flattener.transform(result).show(truncate=False)

+---------------+-----+---+---------+----------+
|chunk          |begin|end|ner_label|confidence|
+---------------+-----+---+---------+----------+
|NY             |0    |1  |STATE    |0.9299    |
|34-year-old    |6    |16 |AGE      |0.7687    |
|Michael Johnson|29   |43 |DOCTOR   |0.89965   |
|CarePlus Clinic|63   |77 |HOSPITAL |0.9661    |
|456 Elm Street |91   |104|STREET   |0.7733667 |
|NewYork        |107  |113|CITY     |0.9302    |
|NY             |116  |117|STATE    |0.9991    |
+---------------+-----+---+---------+----------+



### `possibleRegexContext` and `impossibleRegexContext` Paramaters

In [45]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") \

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunks") \

rules =[{
    "entity": "AGE",
    "scopeWindow": [3, 3],
    "scopeWindowLevel": "token",
    "impossibleRegexContext" : "\\b(1[2-9]\\d|[2-9]\\d{2,}|\\d{4,})\\b"
}]

contextual_entity_filterer = medical.ContextualEntityFilterer() \
    .setInputCols("sentence", "token", "ner_chunks") \
    .setOutputCol("filtered_ner_chunks") \
    .setRules(rules)\
    .setRuleScope("sentence")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    contextual_entity_filterer,
])
empty_df = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_df)
result = model.transform(df)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_generic_augmented download started this may take some time.
[OK!]


In [46]:
text = "California, known for its beautiful beaches,and he is 366 years old. " \
        "The Grand Canyon in Arizona,  where the age is 37, is a stunning natural landmark." \
        "It was founded on September 9, 1850, and Arizona on February 14, 1912."
df = spark.createDataFrame([[text]]).toDF("text")
df.show()

+--------------------+
|                text|
+--------------------+
|California, known...|
+--------------------+



In [47]:
import pyspark.sql.functions as F

In [48]:
ner_chunk_df = result.select(F.explode(F.arrays_zip(
                          result.ner_chunks.result,
                          result.ner_chunks.begin,
                          result.ner_chunks.end,
                          result.ner_chunks.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']").alias("begin"),
                          F.expr("cols['2']").alias("end"),
                          F.expr("cols['3']['entity']").alias("ner_label"),
                          F.expr("cols['3']['confidence']").alias("confidence"))

ner_chunk_df.show(50, truncate=100)

+-----------------+-----+---+---------+----------+
|            chunk|begin|end|ner_label|confidence|
+-----------------+-----+---+---------+----------+
|       California|    0|  9| LOCATION|    0.9895|
|              366|   54| 56|      AGE|    0.9998|
|     Grand Canyon|   73| 84| LOCATION|    0.7097|
|          Arizona|   89| 95| LOCATION|    0.9987|
|               37|  116|117|      AGE|     0.992|
|September 9, 1850|  169|185|     DATE|  0.977525|
|February 14, 1912|  203|219|     DATE|0.95484996|
+-----------------+-----+---+---------+----------+



In [49]:
ner_chunk_df = result.select(F.explode(F.arrays_zip(
                          result.filtered_ner_chunks.result,
                          result.filtered_ner_chunks.begin,
                          result.filtered_ner_chunks.end,
                          result.filtered_ner_chunks.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']").alias("begin"),
                          F.expr("cols['2']").alias("end"),
                          F.expr("cols['3']['entity']").alias("ner_label"),
                          F.expr("cols['3']['confidence']").alias("confidence"))

ner_chunk_df.show(50, truncate=100)

+-----------------+-----+---+---------+----------+
|            chunk|begin|end|ner_label|confidence|
+-----------------+-----+---+---------+----------+
|       California|    0|  9| LOCATION|    0.9895|
|     Grand Canyon|   73| 84| LOCATION|    0.7097|
|          Arizona|   89| 95| LOCATION|    0.9987|
|               37|  116|117|      AGE|     0.992|
|September 9, 1850|  169|185|     DATE|  0.977525|
|February 14, 1912|  203|219|     DATE|0.95484996|
+-----------------+-----+---+---------+----------+



As seen above, regex is found in the context(chunk) (366) , and the chunk is removed.

In [53]:
rules =[{
    "entity": "AGE",
    "possibleRegexContext" : "\\b(1[2-9]\\d|[2-9]\\d{2,}|\\d{4,})\\b"
}]

contextual_entity_filterer = medical.ContextualEntityFilterer() \
    .setInputCols("sentence", "token", "ner_chunks") \
    .setOutputCol("filtered_ner_chunks") \
    .setRules(rules)\
    .setRuleScope("sentence")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    contextual_entity_filterer,
])

empty_df = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_df)
result = model.transform(df)

In [54]:
ner_chunk_df = result.select(F.explode(F.arrays_zip(
                          result.ner_chunks.result,
                          result.ner_chunks.begin,
                          result.ner_chunks.end,
                          result.ner_chunks.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']").alias("begin"),
                          F.expr("cols['2']").alias("end"),
                          F.expr("cols['3']['entity']").alias("ner_label"),
                          F.expr("cols['3']['confidence']").alias("confidence"))

ner_chunk_df.show(50, truncate=100)

+-----------------+-----+---+---------+----------+
|            chunk|begin|end|ner_label|confidence|
+-----------------+-----+---+---------+----------+
|       California|    0|  9| LOCATION|    0.9895|
|              366|   54| 56|      AGE|    0.9998|
|     Grand Canyon|   73| 84| LOCATION|    0.7097|
|          Arizona|   89| 95| LOCATION|    0.9987|
|               37|  116|117|      AGE|     0.992|
|September 9, 1850|  169|185|     DATE|  0.977525|
|February 14, 1912|  203|219|     DATE|0.95484996|
+-----------------+-----+---+---------+----------+



In [55]:
ner_chunk_df = result.select(F.explode(F.arrays_zip(
                          result.filtered_ner_chunks.result,
                          result.filtered_ner_chunks.begin,
                          result.filtered_ner_chunks.end,
                          result.filtered_ner_chunks.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']").alias("begin"),
                          F.expr("cols['2']").alias("end"),
                          F.expr("cols['3']['entity']").alias("ner_label"),
                          F.expr("cols['3']['confidence']").alias("confidence"))

ner_chunk_df.show(50, truncate=100)

+-----------------+-----+---+---------+----------+
|            chunk|begin|end|ner_label|confidence|
+-----------------+-----+---+---------+----------+
|       California|    0|  9| LOCATION|    0.9895|
|              366|   54| 56|      AGE|    0.9998|
|     Grand Canyon|   73| 84| LOCATION|    0.7097|
|          Arizona|   89| 95| LOCATION|    0.9987|
|September 9, 1850|  169|185|     DATE|  0.977525|
|February 14, 1912|  203|219|     DATE|0.95484996|
+-----------------+-----+---+---------+----------+

