![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/AssertionFilterer.ipynb)

# **AssertionFilterer**

This notebook will cover the different parameters and usages of `AssertionFilterer`. This annotator allows to train an AssertionDLModel.

**📖 Learning Objectives:**

1. Understand the meaning and use of assertion status.

2. Learn how to create a chunk column with metadata for training assertion status detection models.

3. Customize your assertion model by using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [AssertionFilterer](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#assertionfilterer)

- Python Docs : [AssertionFilterer](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/chunker/assertion_filterer/)

- Scala Docs : [AssertionFilterer](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/chunker/AssertionFilterer.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb).

## **📜 Background**


The goal of assertion models is to classify chunks of text considering their context. The typical example of assertion status detection is negation identification: in the sentence “the patient has no history of diabetes”, the chunk “diabetes” -extracted by a clinical NER model as a Disease- would be classified as Absent by an assertion model due to the word "no" in its context. A more complex assertion model can include other labels such as Hypothetical, Past, Planned, Possible, Family, etc.

The deep neural network architecture for assertion status detection in Spark NLP is based on a Bi-LSTM framework, and is a modified version of the architecture proposed by Federico Fancellu, Adam Lopez and Bonnie Webber (Neural Networks For Negation Scope Detection.


**AssertionFilterer** will allow you to filter out the named entities by the list of acceptable assertion statuses. This annotator would be quite handy if you want to set a white list for the acceptable assertion statuses like present or conditional; and do not want absent conditions get out of your pipeline.

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

nlp.install()

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
🚨 Outdated Medical Secrets in license file. Version=5.3.3 but should be Version=5.3.2
🚨 Outdated OCR Secrets in license file. Version=5.1.2 but should be Version=5.3.2
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False 

In [None]:
import pandas as pd

spark = nlp.start()
spark

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `CHUNK`, `ASSERTION`

- Output: `CHUNK`

## **Parameters**


- `whiteList`: (list) If defined, list of entities to process. The rest will be ignored.

- `CaseSensitive`: (bool) Determines whether the definitions of the white listed entities are case sensitive.

- `regex`: (list) List of dash-separated pairs of named entities.

- `criteria`: (list)  Set tag representing what is the criteria to filter the chunks. possibles values (assertion|isIn|regex). *assertion*: Filter by the assertion *isIn* : Filter by the chunk *regex* : Filter using a regex

- `entitiesConfidence`: (Str) Entity pairs to remove based on the confidence level.

- `setDoExceptionHandling(True)`: If true, exceptions are handled.

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

clinical_assertion = medical.AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence", "ner_chunk", "assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    #.setWhiteList(["Present", "Planned", "Possible"])

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      clinical_assertion,
      assertion_filterer
    ])

assertionFilter_model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]
assertion_jsl_augmented download started this may take some time.
[OK!]


In [None]:
text = 'Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. Alopecia noted. She denies pain.'

light_model = nlp.LightPipeline(assertionFilter_model)
light_result = light_model.annotate(text)

light_result.keys()

dict_keys(['assertion_filtered', 'document', 'ner_chunk', 'assertion', 'token', 'ner', 'embeddings', 'sentence'])

In [None]:
list(zip(light_result['ner_chunk'], light_result['assertion']))

[('a headache', 'Present'),
 ('a head CT', 'Hypothetical'),
 ('anxious', 'Possible'),
 ('Alopecia', 'Present'),
 ('pain', 'Absent')]

In [None]:
chunks=[]
entities=[]
status=[]
confidence=[]

light_result = light_model.fullAnnotate(text)[0]

for m in light_result['assertion_filtered']:

    chunks.append(m.result)
    entities.append(m.metadata['entity'])
    status.append(m.metadata['assertion'])
    confidence.append(m.metadata['confidence'])

df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status, 'confidence':confidence})

df

Unnamed: 0,chunks,entities,assertion,confidence
0,a headache,PROBLEM,Present,0.97150004
1,a head CT,TEST,Hypothetical,0.8149
2,anxious,PROBLEM,Possible,0.9769
3,Alopecia,PROBLEM,Present,0.9949
4,pain,PROBLEM,Absent,0.9958


## setWhiteList()

In [None]:
assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence", "ner_chunk", "assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(True)\
    .setWhiteList(["Present"])

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      clinical_assertion,
      assertion_filterer
    ])

assertionFilter_model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

light_model = nlp.LightPipeline(assertionFilter_model)
light_result = light_model.fullAnnotate(text)[0]

In [None]:
chunks=[]
entities=[]
status=[]
confidence=[]

for m in light_result['assertion_filtered']:

    chunks.append(m.result)
    entities.append(m.metadata['entity'])
    status.append(m.metadata['assertion'])
    confidence.append(m.metadata['confidence'])

df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status, 'confidence':confidence})

df

Unnamed: 0,chunks,entities,assertion,confidence
0,a headache,PROBLEM,Present,0.97150004
1,Alopecia,PROBLEM,Present,0.9949


As you can see, there are no "pain, head CT, anxious" parts in the whitelist as there is only the "Present" assertion label.





## setBlackList()

In [None]:
assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence","ner_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(True)\
    .setBlackList(["Possible"])\

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      clinical_assertion,
      assertion_filterer
    ])

assertionFilter_model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

light_model = nlp.LightPipeline(assertionFilter_model)
light_result = light_model.fullAnnotate(text)[0]


In [None]:
chunks=[]
entities=[]
status=[]
confidence=[]

for m in light_result['assertion_filtered']:

    chunks.append(m.result)
    entities.append(m.metadata['entity'])
    status.append(m.metadata['assertion'])
    confidence.append(m.metadata['confidence'])

df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status, 'confidence':confidence})

df

Unnamed: 0,chunks,entities,assertion,confidence
0,a headache,PROBLEM,Present,0.97150004
1,a head CT,TEST,Hypothetical,0.8149
2,Alopecia,PROBLEM,Present,0.9949
3,pain,PROBLEM,Absent,0.9958


As you can see, the 'Possible' assertion label and associated entities are filtered.




## setCaseSensitive()

In [None]:
assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence", "ner_chunk", "assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(True)\
    .setWhiteList(["PRESENT"])

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      clinical_assertion,
      assertion_filterer
    ])

assertionFilter_model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

light_model = nlp.LightPipeline(assertionFilter_model)
light_result = light_model.fullAnnotate(text)[0]

In [None]:
chunks=[]
entities=[]
status=[]
confidence=[]

for m in light_result['assertion_filtered']:

    chunks.append(m.result)
    entities.append(m.metadata['entity'])
    status.append(m.metadata['assertion'])
    confidence.append(m.metadata['confidence'])

df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status, 'confidence':confidence})

df

Unnamed: 0,chunks,entities,assertion,confidence


As you can observe, enabling the **setCaseSensitive()** parameter to True would hinder our ability to identify the relevant assertion within the **setWhiteList()** entity due to case sensitivity errors. Consequently, to ensure greater flexibility, the **setCaseSensitive()** parameter is set to its default value of False.

## .setCriteria("isin")

.setCriteria() is used to compare black and white listed values with the result of the Annotation. Possible values are the following: ‘isin’, ‘regex’ and ‘assertion’.

assertion: Filter by the assertion

isin : Filter by the chunk

regex : Filter by using a regex

Default: assertion

You can find the use cases below.


In [None]:
assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence", "ner_chunk", "assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setWhiteList(["Alopecia", "a headache"])\
    .setCriteria("isin")


nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      clinical_assertion,
      assertion_filterer
    ])

assertionFilter_model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

light_model = nlp.LightPipeline(assertionFilter_model)
light_result = light_model.annotate(text)

In [None]:
light_result['ner_chunk']

['a headache', 'a head CT', 'anxious', 'Alopecia', 'pain']

In [None]:
light_result['assertion_filtered']

['a headache', 'Alopecia']

In [None]:
assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence", "ner_chunk", "assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setWhiteList(["gastric problems"])\
    .setCriteria("isin")


nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      clinical_assertion,
      assertion_filterer
    ])

assertionFilter_model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

In [None]:
text = 'I feel a bit drowsy & have a little blurred vision, but so far no gastric problems.'

filter_df = spark.createDataFrame([[text]]).toDF("text")

chunk_filter_result = assertionFilter_model.transform(filter_df)

In [None]:
chunk_filter_result.select('ner_chunk.result','assertion_filtered.result').show(truncate=False)

+---------------------------------------------------------+------------------+
|result                                                   |result            |
+---------------------------------------------------------+------------------+
|[a bit drowsy, a little blurred vision, gastric problems]|[gastric problems]|
+---------------------------------------------------------+------------------+



## .setCriteria("assertion")

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

bert_embeddings = nlp.BertEmbeddings.pretrained("biobert_pubmed_base_cased")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ade_ner_bert = medical.NerModel.pretrained("ner_ade_biobert", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

biobert_assertion = medical.AssertionDLModel.pretrained("assertion_dl_biobert", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence", "ner_chunk", "assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setWhiteList(["absent"])\
    .setCriteria("assertion")

assertion_pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    bert_embeddings,
    ade_ner_bert,
    ner_converter,
    biobert_assertion,
    assertion_filterer])

assertionFilter_model = assertion_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_model = nlp.LightPipeline(assertionFilter_model)

biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
ner_ade_biobert download started this may take some time.
[OK!]
assertion_dl_biobert download started this may take some time.
[OK!]


In [None]:
text = 'I feel a bit drowsy & have a little blurred vision, but so far no gastric problems.'
light_result = light_model.annotate(text)

list(zip(light_result['ner_chunk'], light_result['assertion']))

[('drowsy', 'present'),
 ('blurred vision', 'present'),
 ('gastric problems', 'absent')]

In [None]:
light_result["assertion_filtered"]

['gastric problems']

## .setCriteria("regex") and .setRegex()

In [None]:
assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence", "ner_chunk", "assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setCriteria("regex")\
    .setRegex([".*/.*"])\

assertion_pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    bert_embeddings,
    ade_ner_bert,
    ner_converter,
    biobert_assertion,
    assertion_filterer])

assertionFilter_model = assertion_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
light_model = nlp.LightPipeline(assertionFilter_model)

In [None]:
text = "I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numbness of tongue and gums. I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."

light_result = light_model.annotate(text)

In [None]:
list(zip(light_result['ner_chunk'], light_result['assertion']))

[('allergic reaction', 'present'),
 ('vancomycin', 'present'),
 ('itchy skin', 'present'),
 ('sore throat/burning/itching', 'present'),
 ('numbness of tongue and gums', 'present'),
 ('any other medication', 'present')]

In [None]:
light_result["ner_chunk"]

['allergic reaction',
 'vancomycin',
 'itchy skin',
 'sore throat/burning/itching',
 'numbness of tongue and gums',
 'any other medication']

In [None]:
light_result["assertion_filtered"]

['sore throat/burning/itching']

## .setDoExceptionHandling(True)

The `doExceptionHandling` parameter is designed for annotators to ensure robust exception handling in case the process is interrupted due to corrupted inputs. When enabled, the annotator attempts to process the data as usual. If exception-causing data (e.g., a corrupted record or document) is encountered, an exception warning is emitted with the relevant exception message, while the rest of the records within the same batch are processed without interruption. By default, this parameter is set to `False`, meaning the process will throw an exception and halt to inform users of the issue.









```
assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence", "ner_chunk", "assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setDoExceptionHandling(True)
```

