![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/ContextualEntityFilterer.ipynb)

#   **📜 ContextualEntityFilterer**


The  **`ContextualEntityFilterer`** annotator was developed to prevent certain entities from causing interference. It can be used to filter out specific entities, ensuring accurate results are preserved.

**📖 Learning Objectives:**

1. Understand how to use the annotator.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

- Reference Documentation: [ContextualEntityFilterer](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators)


## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=False
nlp.install(refresh_install=True)

In [None]:
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `CHUNK`, `TOKEN`

- Output: `CHUNK`

## **🔎 Parameters**


**Parameters**:

- `ruleScope`: The rule scope to apply the filter. Options: sentence, document.(str)
- `rules`: list[dict]
         
        - `entity`: The target entity field for filtering.
        - `scopeWindow`: A list of two integers [before, after], specifying how many tokens/chunks before and after the target to consider.
        - `whiteListEntities`: The white list of entities. If one of the entity from this list appears within the scope window, the chunk will be kept. Only one element is enough to keep the chunk.
        - `blackListEntities`: The black list of entities. If an entity from this list appears within the scope window, the chunk will be filtered out. All elements must be absent to keep the chunk.
        - `scopeWindowLevel`: Determines whether the `scopeWindow` is applied at the token or chunk level. Options: `token`, `chunk`.
        - `blackListWords`: The black list of words. If a word from this list appears within the scope window, the chunk will be filtered out.
        - `whiteListWords`: The white list of words. If a word from this list appears within the scope window, the chunk will be kept.
        - `confidenceThreshold`: The confidence threshold to filter the chunks. Filtering is only applied if the confidence of the chunk is below the threshold.     
  

### Pipeline

In [None]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["document"])\
      .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
      .setInputCols(["document", "token"])\
      .setOutputCol("embeddings")

ner_deid = medical.NerModel.pretrained("ner_deid_subentity_docwise", "en", "clinical/models")  \
      .setInputCols(["document", "token", "embeddings"]) \
      .setOutputCol("ner_deid_subentity_docwise")

ner_deid_converter = medical.NerConverterInternal()\
      .setInputCols(["document", "token", "ner_deid_subentity_docwise"])\
      .setOutputCol("ner_chunk_subentity_docwise")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_subentity_docwise download started this may take some time.
[OK!]


### `setRules`

In [None]:
rules =[{       "entity": "STATE",
                "scopeWindow": [2, 2],
                "whiteList": ["CITY"],
                "blackList": ["NAME"],
                "scopeWindowLevel": "token"
            }]

In [None]:
contextual_entity_filterer = medical.ContextualEntityFilterer() \
    .setInputCols("document", "token", "ner_chunk_subentity_docwise") \
    .setOutputCol("filtered_ner_chunks") \
    .setRules(rules)\
    .setRuleScope("sentence") # document

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      tokenizer,
      word_embeddings,
      ner_deid,
      ner_deid_converter,
      contextual_entity_filterer
      ])

In [None]:
text = "NY, a 34-year-old woman, Dr. Michael Johnson cares wit her, at CarePlus Clinic, located at 456 Elm Street, NewYork, NY has recommended starting insulin therapy."
df = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(df).transform(df).cache()

In [None]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------------+---------------------------+--------------------+
|                text|            document|               token|          embeddings|ner_deid_subentity_docwise|ner_chunk_subentity_docwise| filtered_ner_chunks|
+--------------------+--------------------+--------------------+--------------------+--------------------------+---------------------------+--------------------+
|NY, a 34-year-old...|[{document, 0, 15...|[{token, 0, 1, NY...|[{word_embeddings...|      [{named_entity, 0...|       [{chunk, 0, 1, NY...|[{chunk, 6, 16, 3...|
+--------------------+--------------------+--------------------+--------------------+--------------------------+---------------------------+--------------------+



In [None]:
result.selectExpr("explode(ner_chunk_subentity_docwise) as ner_chunk").show(50,truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner_chunk                                                                                                                                              |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 0, 1, NY, {chunk -> 0, confidence -> 0.9299, ner_source -> ner_chunk_subentity_docwise, entity -> STATE, sentence -> 0}, []}                   |
|{chunk, 6, 16, 34-year-old, {chunk -> 1, confidence -> 0.7687, ner_source -> ner_chunk_subentity_docwise, entity -> AGE, sentence -> 0}, []}           |
|{chunk, 29, 43, Michael Johnson, {chunk -> 2, confidence -> 0.89965, ner_source -> ner_chunk_subentity_docwise, entity -> DOCTOR, sentence -> 0}, []}  |
|{chunk, 63, 77, CarePlus Clinic, {chunk -> 3, confidence -> 0.9661, ner_sou

In [None]:
result.selectExpr("explode(filtered_ner_chunks) as filtered_chunks").show(50,truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
|filtered_chunks                                                                                                                                        |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 6, 16, 34-year-old, {chunk -> 1, confidence -> 0.7687, ner_source -> ner_chunk_subentity_docwise, entity -> AGE, sentence -> 0}, []}           |
|{chunk, 29, 43, Michael Johnson, {chunk -> 2, confidence -> 0.89965, ner_source -> ner_chunk_subentity_docwise, entity -> DOCTOR, sentence -> 0}, []}  |
|{chunk, 63, 77, CarePlus Clinic, {chunk -> 3, confidence -> 0.9661, ner_source -> ner_chunk_subentity_docwise, entity -> HOSPITAL, sentence -> 0}, []} |
|{chunk, 91, 104, 456 Elm Street, {chunk -> 4, confidence -> 0.7733667, ner_

In [None]:
flattener = medical.Flattener()\
    .setInputCols("ner_chunk_subentity_docwise") \
    .setExplodeSelectedFields({"ner_chunk_subentity_docwise": ["result as chunk",
                                                                "begin as begin",
                                                                "end as end",
                                                                "metadata.entity as ner_label",
                                                                "metadata.confidence as confidence"]})

In [None]:
flattener.transform(result).show(truncate=False)

+---------------+-----+---+---------+----------+
|chunk          |begin|end|ner_label|confidence|
+---------------+-----+---+---------+----------+
|NY             |0    |1  |STATE    |0.9299    |
|34-year-old    |6    |16 |AGE      |0.7687    |
|Michael Johnson|29   |43 |DOCTOR   |0.89965   |
|CarePlus Clinic|63   |77 |HOSPITAL |0.9661    |
|456 Elm Street |91   |104|STREET   |0.7733667 |
|NewYork        |107  |113|CITY     |0.9302    |
|NY             |116  |117|STATE    |0.9991    |
+---------------+-----+---+---------+----------+



In [None]:
flattener = medical.Flattener()\
    .setInputCols("filtered_ner_chunks") \
    .setExplodeSelectedFields({"filtered_ner_chunks": ["result as chunk",
                                                       "begin as begin",
                                                       "end as end",
                                                       "metadata.entity as ner_label",
                                                       "metadata.confidence as confidence"]})
flattener.transform(result).show(truncate=False)

+---------------+-----+---+---------+----------+
|chunk          |begin|end|ner_label|confidence|
+---------------+-----+---+---------+----------+
|34-year-old    |6    |16 |AGE      |0.7687    |
|Michael Johnson|29   |43 |DOCTOR   |0.89965   |
|CarePlus Clinic|63   |77 |HOSPITAL |0.9661    |
|456 Elm Street |91   |104|STREET   |0.7733667 |
|NewYork        |107  |113|CITY     |0.9302    |
|NY             |116  |117|STATE    |0.9991    |
+---------------+-----+---+---------+----------+

