![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.3.Contextual_Assertion.ipynb)

# 📜Contextual Assertion



This model identifies  contextual cues within text data, such as negation, uncertainty etc. It is used
clinical assertion detection, etc. It annotates text chunks with assertions based on configurable rules,
prefix and suffix patterns, and exception patterns.



## **🎬 Colab Setup**

In [None]:

import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.4.0  spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install --upgrade -q spark-nlp-display

In [None]:
import json
import os

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.common import *
from sparknlp.training import *
from sparknlp.annotator import *

from sparknlp_jsl.base import *
from sparknlp_jsl.annotator import *
from sparknlp_jsl.annotator.flattener import *

from pyspark.ml import Pipeline
from pyspark.sql.types import StringType
import pyspark.sql.types as T
import pyspark.sql.functions as F

import pandas as pd
import numpy as np

spark = sparknlp_jsl.start(license_keys['SECRET'] #gpu=True
                           )
print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.4.0
Spark NLP_JSL Version : 5.4.0




## **🖨️ Input/Output Annotation Types**

- Input: `SENTECE`, `TOKEN`, `CHUNK`

- Output: `ASSERTION`

## **🔎 Parameters**

**Parameters**:

- `inputCols`: Input annotations.
- `caseSensitive`: Whether to use case sensitive when matching values. By default `False`.
- `prefixAndSuffixMatch`: Whether to match both prefix and suffix to annotate the hit
- `prefixKeywords`: Prefix keywords to match
- `suffixKeywords`: Suffix keywords to match
- `exceptionKeywords`: Exception keywords not to match
- `prefixRegexPatterns`: Prefix regex patterns to match
- `suffixRegexPatterns`: Suffix regex pattern to match
- `exceptionRegexPatterns`: Exception regex pattern not to match
- `scopeWindow`: The scope window of the assertion expression
- `assertion`:Assertion to match
- `includeChunkToScope`:Whether to include chunk to scope when matching values

##PIPELINE

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

clinical_ner = MedicalNerModel \
    .pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


##GENERAL USAGE

To configure assertions using specific keywords and regex patterns, you can use the methods `setPrefixKeywords`, `setSuffixKeywords`, `setPrefixRegexPatterns`, and `setSuffixRegexPatterns`.

If you want to exclude certain keywords and patterns from being detected, use `setExceptionKeywords` and `setExceptionRegexPatterns`.

To add to the default keywords and regex patterns, or to include pretrained model keywords and patterns, use the methods `addPrefixKeywords` and `addSuffixKeywords`.

If case sensitivity is important for the keywords being searched, you can enable this by setting `setCaseSensitive` to `True`.

To search for matches at both the beginning and end of chunks, activate this feature by setting `setPrefixAndSuffixMatch` to `True`.

You can define the name of the assertion being searched for using the `setAssertion` method.

To specify the number of tokens before and after the chunk within which keywords and regex patterns should be searched, use the `setScopeWindow` method. By default, the search is conducted throughout the entire sentence.

If you want to include chunk in the search of keywords and regex patterns, the `setIncludeChunkToScope` method is used.


In [None]:
text = """Patient resting in bed. Patient given azithromycin without any difficulty. Patient has audible wheezing, states chest tightness.
     No evidence of hypertension. Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating
     associated with pneumonia. Patient refused pain but tylenol still given. Neither substance abuse nor alcohol use however cocaine
     once used in the last year. Alcoholism unlikely. Patient has headache and fever. Patient is not diabetic. Not clearly of diarrhea.
     Lab reports confirm lymphocytopenia. Cardaic rhythm is Sinus bradycardia. Patient also has a history of cardiac injury.
     No kidney injury reported. No abnormal rashes or ulcers. Patient might not have liver disease. Confirmed absence of hemoptysis.
     Although patient has severe pneumonia and fever, test reports are negative for COVID-19 infection. COVID-19 viral infection absent.
    """
data = spark.createDataFrame([[text]]).toDF("text")

In [None]:
contextual_assertion = ContextualAssertion() \
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("assertion") \
    .setPrefixKeywords(["no", "not"]) \
    .setSuffixKeywords(["unlikely","negative","no"]) \
    .setPrefixRegexPatterns(["\\b(no|without|denies|never|none|free of|not include)\\b"]) \
    .setSuffixRegexPatterns(["\\b(free of|negative for|absence of|not|rule out)\\b"]) \
    .setExceptionKeywords(["without"]) \
    .setExceptionRegexPatterns(["\\b(not clearly)\\b"]) \
    .addPrefixKeywords(["negative for","negative"]) \
    .addSuffixKeywords(["absent","neither"]) \
    .setCaseSensitive(False) \
    .setPrefixAndSuffixMatch(False) \
    .setAssertion("absent") \
    .setScopeWindow([2, 2])\
    .setIncludeChunkToScope(True)\

flattener = Flattener() \
    .setInputCols("assertion") \


pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        contextual_assertion,
        flattener
        ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

result = model.transform(data)
result.show(truncate=False)


+----------------+---------------+-------------+-----------------------------------+------------------------+----------------------------+-----------------------------+----------------------------+---------------------------+
|assertion_result|assertion_begin|assertion_end|assertion_metadata_assertion_source|assertion_metadata_chunk|assertion_metadata_ner_chunk|assertion_metadata_confidence|assertion_metadata_ner_label|assertion_metadata_sentence|
+----------------+---------------+-------------+-----------------------------------+------------------------+----------------------------+-----------------------------+----------------------------+---------------------------+
|absent          |178            |183          |assertion                          |5                       |nausea                      |0.50                         |PROBLEM                     |4                          |
|absent          |428            |437          |assertion                          |11          

##NEGEX( DEFAULT )

In [None]:
text = """Patient resting in bed. Patient given azithromycin without any difficulty. Patient has audible wheezing, states chest tightness.
     No evidence of hypertension. Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating
     associated with pneumonia. Patient refused pain but tylenol still given. Neither substance abuse nor alcohol use however cocaine
     once used in the last year. Alcoholism unlikely. Patient has headache and fever. Patient is not diabetic. Not clearly of diarrhea.
     Lab reports confirm lymphocytopenia. Cardaic rhythm is Sinus bradycardia. Patient also has a history of cardiac injury.
     No kidney injury reported. No abnormal rashes or ulcers. Patient might not have liver disease. Confirmed absence of hemoptysis.
     Although patient has severe pneumonia and fever, test reports are negative for COVID-19 infection. COVID-19 viral infection absent.
    """
data = spark.createDataFrame([[text]]).toDF("text")

In [None]:
contextual_assertion = ContextualAssertion()\
            .setInputCols("sentence", "token", "ner_chunk") \
            .setOutputCol("assertion") \


pipeline = Pipeline(
    stages=[
          document_assembler,
          sentence_detector,
          tokenizer,
          word_embeddings,
          clinical_ner,
          ner_converter,
          contextual_assertion,
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

result = model.transform(data)


In [None]:
from sparknlp_display import AssertionVisualizer

vis = AssertionVisualizer()

vis.display(result.collect()[0], 'ner_chunk', 'assertion')

In [None]:
flattener = Flattener() \
    .setInputCols("assertion") \
    .setExplodeSelectedFields({"assertion":["metadata.ner_chunk as ner_chunk",
                                            "begin as begin",
                                            "end as end",
                                            "metadata.ner_label as ner_label",
                                            "result"]})
pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        contextual_assertion,
        flattener
        ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

result = model.transform(data)
result.show(truncate=False)

+------------------+-----+---+---------+----------------+
|ner_chunk         |begin|end|ner_label|assertion_result|
+------------------+-----+---+---------+----------------+
|any difficulty    |59   |72 |PROBLEM  |absent          |
|hypertension      |149  |160|PROBLEM  |absent          |
|nausea            |178  |183|PROBLEM  |absent          |
|zofran            |199  |204|TREATMENT|absent          |
|pain              |309  |312|PROBLEM  |absent          |
|tylenol           |318  |324|TREATMENT|absent          |
|Alcoholism        |428  |437|PROBLEM  |absent          |
|diabetic          |496  |503|PROBLEM  |absent          |
|kidney injury     |664  |676|PROBLEM  |absent          |
|abnormal rashes   |691  |705|PROBLEM  |absent          |
|ulcers            |710  |715|PROBLEM  |absent          |
|liver disease     |741  |753|PROBLEM  |absent          |
|hemoptysis        |777  |786|PROBLEM  |absent          |
|COVID-19 infection|873  |890|PROBLEM  |absent          |
|viral infecti

##DATE

In [None]:
dateText = """Patient resting in bed. Patient given azithromycin without any difficulty. Patient has audible wheezing, states chest tightness.
      No evidence of hypertension. Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating
      associated with pneumonia. Patient refused pain but tylenol still given. Neither substance abuse nor alcohol use however cocaine
      once used in the last year. Alcoholism detected in 2022-01-15. Patient has headache and fever. Patient diabetic in the past. Not clearly of diarrhea.
      Lab reports confirm lymphocytopenia. Cardaic rhythm is Sinus bradycardia. Patient also has a history of cardiac injury.
      Kidney injury will be reported in 2024-01-15. No abnormal rashes or ulcers. Patient have liver disease before. Confirmed absence of hemoptysis.
      Although patient has severe pneumonia and fever, test reports are negative for COVID-19 infection. COVID-19 viral infection can detect in the future.
      """
pattern = '(?:yesterday|last\s(?:night|week|month|year|Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)|(?:a|an|one|two|three|four|five|six|seven|eight|nine|ten|several|many|few)\s(?:days?|weeks?|months?|years?)\sago|in\s(?:\d{4}|\d{1,2}\s(?:January|February|March|April|May|June|July|August|September|October|November|December)))'

data = spark.createDataFrame([[dateText]]).toDF("text")

In [None]:
contextual_assertion = ContextualAssertion() \
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("assertion") \
    .setPrefixKeywords(["before", "past"]) \
    .setSuffixKeywords(["before","past"]) \
    .setPrefixRegexPatterns([pattern]) \
    .setSuffixRegexPatterns([pattern]) \
    .setScopeWindow([5,5])\
    .setAssertion("past")

flattener = Flattener() \
    .setInputCols("assertion") \
    .setExplodeSelectedFields({"assertion":["metadata.ner_chunk as ner_chunk",
                                            "begin as begin",
                                            "end as end",
                                            "metadata.ner_label as ner_label",
                                            "result"]})
pipeline = Pipeline(
    stages=[
          document_assembler,
          sentence_detector,
          tokenizer,
          word_embeddings,
          clinical_ner,
          ner_converter,
          contextual_assertion,
          flattener
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

result = model.transform(data)
result.show(truncate=False)

+-------------+-----+---+---------+----------------+
|ner_chunk    |begin|end|ner_label|assertion_result|
+-------------+-----+---+---------+----------------+
|Alcoholism   |431  |440|PROBLEM  |past            |
|diabetic     |506  |513|PROBLEM  |past            |
|Kidney injury|685  |697|PROBLEM  |past            |
|liver disease|774  |786|PROBLEM  |past            |
+-------------+-----+---+---------+----------------+



##FAMILY

In [None]:
familyText = """Patient has a family history of diabetes. Mother's side has a history of hypertension.
                Father diagnosed with heart disease last year. Sister and brother both have asthma.
                Grandfather had cancer in his late 70s. No known family history of substance abuse.
                Family history of autoimmune diseases is also noted."""

familyPatterns = ["(mother|father|sister|brother|grandfather|grandmother|uncle|aunt|cousin|family)"]
familyKeywords = ["family history", "family member"]

data = spark.createDataFrame([[familyText]]).toDF("text")


In [None]:
contextualAssertionFamily = ContextualAssertion()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("assertionFamily")\
    .setPrefixRegexPatterns(familyPatterns)\
    .setSuffixRegexPatterns(familyPatterns)\
    .setPrefixKeywords(familyKeywords)\
    .setSuffixKeywords(familyKeywords)\
    .setCaseSensitive(False)\
    .setAssertion("family")\

flattener =Flattener()\
    .setInputCols("assertionFamily")\
    .setExplodeSelectedFields({"assertionFamily":["metadata.ner_chunk as ner_chunk",
                                            "begin as begin",
                                            "end as end",
                                            "metadata.ner_label as ner_label",
                                            "result"]})

pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        contextualAssertionFamily,
        flattener

    ])


empty_data = spark.createDataFrame([[""]]).toDF("text")


model = pipeline.fit(empty_data)
result = model.transform(data)
result.show(truncate=False)

+-------------------+-----+---+---------+----------------------+
|ner_chunk          |begin|end|ner_label|assertionFamily_result|
+-------------------+-----+---+---------+----------------------+
|diabetes           |32   |39 |PROBLEM  |family                |
|hypertension       |73   |84 |PROBLEM  |family                |
|heart disease      |125  |137|PROBLEM  |family                |
|asthma             |179  |184|PROBLEM  |family                |
|cancer             |219  |224|PROBLEM  |family                |
|substance abuse    |270  |284|PROBLEM  |family                |
|autoimmune diseases|321  |339|PROBLEM  |family                |
+-------------------+-----+---+---------+----------------------+

