![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/02.3.Contextual_Assertion.ipynb)

# 📜Contextual Assertion



This model identifies  contextual cues within text data, such as negation, uncertainty etc. It is used in
clinical assertion detection. It annotates text chunks with assertions based on configurable rules,
prefix and suffix patterns, and exception patterns.



## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=True
nlp.install(refresh_install=True)

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only



## **🖨️ Input/Output Annotation Types**

- Input: `SENTECE`, `TOKEN`, `CHUNK`

- Output: `ASSERTION`

## **🔎 Parameters**

**Parameters**:

- `inputCols`: Input annotations.
- `caseSensitive`: Whether to use case sensitive when matching values. By default `False`.
- `prefixAndSuffixMatch`: Whether to match both prefix and suffix to annotate the hit
- `prefixKeywords`: Prefix keywords to match
- `suffixKeywords`: Suffix keywords to match
- `exceptionKeywords`: Exception keywords not to match
- `prefixRegexPatterns`: Prefix regex patterns to match
- `suffixRegexPatterns`: Suffix regex pattern to match
- `exceptionRegexPatterns`: Exception regex pattern not to match
- `scopeWindow`: The scope window of the assertion expression
- `assertion`:Assertion to match
- `includeChunkToScope`:Whether to include chunk to scope when matching values
- `setConfidenceCalculationDirection` : Allows users to specify the direction (left, right, or both)

## PIPELINE

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel \
    .pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
Approximate size to download 13.9 MB
[OK!]


### GENERAL USAGE

To configure assertions using specific keywords and regex patterns, you can use the methods `setPrefixKeywords`, `setSuffixKeywords`, `setPrefixRegexPatterns`, and `setSuffixRegexPatterns`.

If you want to exclude certain keywords and patterns from being detected, use `setExceptionKeywords` and `setExceptionRegexPatterns`.

To add to the default keywords and regex patterns, or to include pretrained model keywords and patterns, use the methods `addPrefixKeywords` and `addSuffixKeywords`.

If case sensitivity is important for the keywords being searched, you can enable this by setting `setCaseSensitive` to `True`.

To search for matches at both the beginning and end of chunks, activate this feature by setting `setPrefixAndSuffixMatch` to `True`.

You can define the name of the assertion being searched for using the `setAssertion` method.

To specify the number of tokens before and after the chunk within which keywords and regex patterns should be searched, use the `setScopeWindow` method. By default, the search is conducted throughout the entire sentence.

If you want to include chunk in the search of keywords and regex patterns, the `setIncludeChunkToScope` method is used.


In [None]:
text = """Patient resting in bed. Patient given azithromycin without any difficulty. Patient has audible wheezing, states chest tightness.
     No evidence of hypertension. Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating
     associated with pneumonia. Patient refused pain but tylenol still given. Neither substance abuse nor alcohol use however cocaine
     once used in the last year. Alcoholism unlikely. Patient has headache and fever. Patient is not diabetic. Not clearly of diarrhea.
     Lab reports confirm lymphocytopenia. Cardaic rhythm is Sinus bradycardia. Patient also has a history of cardiac injury.
     No kidney injury reported. No abnormal rashes or ulcers. Patient might not have liver disease. Confirmed absence of hemoptysis.
     Although patient has severe pneumonia and fever, test reports are negative for COVID-19 infection. COVID-19 viral infection absent.
    """
data = spark.createDataFrame([[text]]).toDF("text")

In [None]:
contextual_assertion = medical.ContextualAssertion() \
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("assertion") \
    .setPrefixKeywords(["no", "not"]) \
    .setSuffixKeywords(["unlikely","negative","no"]) \
    .setPrefixRegexPatterns(["\\b(no|without|denies|never|none|free of|not include)\\b"]) \
    .setSuffixRegexPatterns(["\\b(free of|negative for|absence of|not|rule out)\\b"]) \
    .setExceptionKeywords(["without"]) \
    .setExceptionRegexPatterns(["\\b(not clearly)\\b"]) \
    .addPrefixKeywords(["negative for","negative"]) \
    .addSuffixKeywords(["absent","neither"]) \
    .setCaseSensitive(False) \
    .setPrefixAndSuffixMatch(False) \
    .setAssertion("absent") \
    .setScopeWindow([2, 2])\
    .setIncludeChunkToScope(True)\

flattener = medical.Flattener() \
    .setInputCols("assertion") \


pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        contextual_assertion,
        flattener
        ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

result = model.transform(data)
result.show(truncate=False)


+----------------+---------------+-------------+-----------------------------------+------------------------+----------------------------+-----------------------------+----------------------------+---------------------------+
|assertion_result|assertion_begin|assertion_end|assertion_metadata_assertion_source|assertion_metadata_chunk|assertion_metadata_ner_chunk|assertion_metadata_confidence|assertion_metadata_ner_label|assertion_metadata_sentence|
+----------------+---------------+-------------+-----------------------------------+------------------------+----------------------------+-----------------------------+----------------------------+---------------------------+
|absent          |178            |183          |assertion                          |5                       |nausea                      |0.8694                       |PROBLEM                     |4                          |
|absent          |428            |437          |assertion                          |11          

### NEGEX( DEFAULT )

In [None]:
text = """Patient resting in bed. Patient given azithromycin without any difficulty. Patient has audible wheezing, states chest tightness.
     No evidence of hypertension. Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating
     associated with pneumonia. Patient refused pain but tylenol still given. Neither substance abuse nor alcohol use however cocaine
     once used in the last year. Alcoholism unlikely. Patient has headache and fever. Patient is not diabetic. Not clearly of diarrhea.
     Lab reports confirm lymphocytopenia. Cardaic rhythm is Sinus bradycardia. Patient also has a history of cardiac injury.
     No kidney injury reported. No abnormal rashes or ulcers. Patient might not have liver disease. Confirmed absence of hemoptysis.
     Although patient has severe pneumonia and fever, test reports are negative for COVID-19 infection. COVID-19 viral infection absent.
    """
data = spark.createDataFrame([[text]]).toDF("text")

In [None]:
contextual_assertion = medical.ContextualAssertion()\
            .setInputCols("sentence", "token", "ner_chunk") \
            .setOutputCol("assertion") \


pipeline = nlp.Pipeline(
    stages=[
          document_assembler,
          sentence_detector,
          tokenizer,
          word_embeddings,
          clinical_ner,
          ner_converter,
          contextual_assertion,
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

result = model.transform(data)


In [None]:
vis = nlp.viz.AssertionVisualizer()

vis.display(result.collect()[0], 'ner_chunk', 'assertion')

In [None]:
flattener = medical.Flattener() \
    .setInputCols("assertion") \
    .setExplodeSelectedFields({"assertion":["metadata.ner_chunk as ner_chunk",
                                            "begin as begin",
                                            "end as end",
                                            "metadata.ner_label as ner_label",
                                            "result"]})
pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        contextual_assertion,
        flattener
        ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

result = model.transform(data)
result.show(truncate=False)

+------------------+-----+---+---------+----------------+
|ner_chunk         |begin|end|ner_label|assertion_result|
+------------------+-----+---+---------+----------------+
|any difficulty    |59   |72 |PROBLEM  |absent          |
|hypertension      |149  |160|PROBLEM  |absent          |
|nausea            |178  |183|PROBLEM  |absent          |
|zofran            |199  |204|TREATMENT|absent          |
|pain              |309  |312|PROBLEM  |absent          |
|tylenol           |318  |324|TREATMENT|absent          |
|Alcoholism        |428  |437|PROBLEM  |absent          |
|diabetic          |496  |503|PROBLEM  |absent          |
|kidney injury     |664  |676|PROBLEM  |absent          |
|abnormal rashes   |691  |705|PROBLEM  |absent          |
|ulcers            |710  |715|PROBLEM  |absent          |
|liver disease     |741  |753|PROBLEM  |absent          |
|hemoptysis        |777  |786|PROBLEM  |absent          |
|COVID-19 infection|873  |890|PROBLEM  |absent          |
|viral infecti

### DATE

In [None]:
dateText = """Patient resting in bed. Patient given azithromycin without any difficulty. Patient has audible wheezing, states chest tightness.
      No evidence of hypertension. Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating
      associated with pneumonia. Patient refused pain but tylenol still given. Neither substance abuse nor alcohol use however cocaine
      once used in the last year. Alcoholism detected in 2022-01-15. Patient has headache and fever. Patient diabetic in the past. Not clearly of diarrhea.
      Lab reports confirm lymphocytopenia. Cardaic rhythm is Sinus bradycardia. Patient also has a history of cardiac injury.
      Kidney injury will be reported in 2024-01-15. No abnormal rashes or ulcers. Patient have liver disease before. Confirmed absence of hemoptysis.
      Although patient has severe pneumonia and fever, test reports are negative for COVID-19 infection. COVID-19 viral infection can detect in the future.
      """
pattern = '(?:yesterday|last\s(?:night|week|month|year|Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)|(?:a|an|one|two|three|four|five|six|seven|eight|nine|ten|several|many|few)\s(?:days?|weeks?|months?|years?)\sago|in\s(?:\d{4}|\d{1,2}\s(?:January|February|March|April|May|June|July|August|September|October|November|December)))'

data = spark.createDataFrame([[dateText]]).toDF("text")

In [None]:
contextual_assertion = medical.ContextualAssertion() \
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("assertion") \
    .setPrefixKeywords(["before", "past"]) \
    .setSuffixKeywords(["before","past"]) \
    .setPrefixRegexPatterns([pattern]) \
    .setSuffixRegexPatterns([pattern]) \
    .setScopeWindow([5,5])\
    .setAssertion("past")

flattener = medical.Flattener() \
    .setInputCols("assertion") \
    .setExplodeSelectedFields({"assertion":["metadata.ner_chunk as ner_chunk",
                                            "begin as begin",
                                            "end as end",
                                            "metadata.ner_label as ner_label",
                                            "result"]})
pipeline = nlp.Pipeline(
    stages=[
          document_assembler,
          sentence_detector,
          tokenizer,
          word_embeddings,
          clinical_ner,
          ner_converter,
          contextual_assertion,
          flattener
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

result = model.transform(data)
result.show(truncate=False)

+-------------+-----+---+---------+----------------+
|ner_chunk    |begin|end|ner_label|assertion_result|
+-------------+-----+---+---------+----------------+
|Alcoholism   |431  |440|PROBLEM  |past            |
|diabetic     |506  |513|PROBLEM  |past            |
|Kidney injury|685  |697|PROBLEM  |past            |
|liver disease|774  |786|PROBLEM  |past            |
+-------------+-----+---+---------+----------------+



### FAMILY

In [None]:
familyText = """Patient has a family history of diabetes. Mother's side has a history of hypertension.
                Father diagnosed with heart disease last year. Sister and brother both have asthma.
                Grandfather had cancer in his late 70s. No known family history of substance abuse.
                Family history of autoimmune diseases is also noted."""

familyPatterns = ["(mother|father|sister|brother|grandfather|grandmother|uncle|aunt|cousin|family)"]
familyKeywords = ["family history", "family member"]

data = spark.createDataFrame([[familyText]]).toDF("text")


In [None]:
contextualAssertionFamily = medical.ContextualAssertion()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("assertionFamily")\
    .setPrefixRegexPatterns(familyPatterns)\
    .setSuffixRegexPatterns(familyPatterns)\
    .setPrefixKeywords(familyKeywords)\
    .setSuffixKeywords(familyKeywords)\
    .setCaseSensitive(False)\
    .setAssertion("family")\

flattener = medical.Flattener()\
    .setInputCols("assertionFamily")\
    .setExplodeSelectedFields({"assertionFamily":["metadata.ner_chunk as ner_chunk",
                                            "begin as begin",
                                            "end as end",
                                            "metadata.ner_label as ner_label",
                                            "result"]})

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        contextualAssertionFamily,
        flattener

    ])


empty_data = spark.createDataFrame([[""]]).toDF("text")


model = pipeline.fit(empty_data)
result = model.transform(data)
result.show(truncate=False)

+-------------------+-----+---+---------+----------------------+
|ner_chunk          |begin|end|ner_label|assertionFamily_result|
+-------------------+-----+---+---------+----------------------+
|diabetes           |32   |39 |PROBLEM  |family                |
|hypertension       |73   |84 |PROBLEM  |family                |
|heart disease      |125  |137|PROBLEM  |family                |
|asthma             |179  |184|PROBLEM  |family                |
|cancer             |219  |224|PROBLEM  |family                |
|substance abuse    |270  |284|PROBLEM  |family                |
|autoimmune diseases|321  |339|PROBLEM  |family                |
+-------------------+-----+---+---------+----------------------+



## Pretrained Models

| Model Name                                                            |      Description            |
|-----------------------------------------------------------------------|-----------------------------|
| [`contextual_assertion_someone_else`](https://nlp.johnsnowlabs.com/2024/06/26/contextual_assertion_someone_else_en.html) |  Identifies contextual cues within text data to detect `someone else` assertions |
| [`contextual_assertion_absent`](https://nlp.johnsnowlabs.com/2024/07/03/contextual_assertion_absent_en.html) |  Identifies contextual cues within text data to detect `absent` assertions |
| [`contextual_assertion_past`](https://nlp.johnsnowlabs.com/2024/07/04/contextual_assertion_past_en.html) |  Identifies contextual cues within text data to detect `past` assertions |

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel \
    .pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

contextual_assertion_someoneElse = medical.ContextualAssertion\
     .pretrained("contextual_assertion_someone_else" ,"en" ,"clinical/models")\
     .setInputCols("sentence", "token", "ner_chunk")\
     .setOutputCol("assertionSomeoneElse")

flattener = medical.Flattener()\
      .setInputCols("assertionSomeoneElse")\
      .setExplodeSelectedFields({"assertionSomeoneElse": ["result as result",
                                                          "begin as begin ",
                                                          "end as end",
                                                          "metadata.ner_chunk as ner_chunk",
                                                          "metadata.ner_label as ner_label"]
                               })

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    contextual_assertion_someoneElse,
    flattener

])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)
text = """Patient has a family history of diabetes. Father diagnosed with heart failure last year. Sister and brother both have asthma.
          Grandfather had cancer in his late 70s. No known family history of substance abuse. Family history of autoimmune diseases is also noted."""

data = spark.createDataFrame([[text]]).toDF('text')

result = model.transform(data)
result.show(truncate=False)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
contextual_assertion_someone_else download started this may take some time.
Approximate size to download 1.5 KB
[OK!]
+----------------------------+-----+---+-------------------+---------+
|result                      |begin|end|ner_chunk          |ner_label|
+----------------------------+-----+---+-------------------+---------+
|associated_with_someone_else|32   |39 |diabetes           |PROBLEM  |
|associated_with_someone_else|64   |76 |heart failure      |PROBLEM  |
|associated_with_someone_else|118  |123|asthma             |PROBLEM  |
|associated_with_someone_else|152  |157|cancer             |PROBLEM  |
|associated_with_someone_else|203  |217|substance abuse    |PROBLEM  |
|associated_with_someone_else|238  |256|autoimmune diseases|PROBLEM  |
+----------------------------+-----+---+-

##  Contextual Assertion Benchmarks for Enhanced Clinical Text Annotation

The following benchmark demonstrates the benefits and efficiency of using Contextual Assertion, either alone or in combination with AssertionDL models, for annotating clinical texts. The benchmark evaluates the performance of these methods in various assertion-focused pipelines, offering valuable insights into their scalability and effectiveness.

- `Dataset`: 253 Clinical Texts from in-house dataset
  - train set count: 201
  - test set count: 52

**AssertionDL Pipeline:**

```python
clinical_assertion_pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        assertionDL  # Pratrained AssertionDL Model
    ])

```

*AssertionDL Results:*
```bash
       label  precision    recall  f1-score   support
      Absent       0.80      0.76      0.78        21
        Past       0.63      0.67      0.65        18
     Present       0.54      0.54      0.54        13
    accuracy        -         -        0.67        52
   macro-avg       0.66      0.66      0.66        52
weighted-avg       0.68      0.67      0.67        52

```

**Contextual Assertion Pipeline:**

```python
clinical_assertion_pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        contextual_assertion_absent, # Rule based Contextual Assertion
        contextual_assertion_past,   # Rule based Contextual Assertion
        assertionMerger
    ])

```

*Contextual Assertion Results:*
```bash
       label   precision    recall  f1-score   support
      Absent       1.00      0.76      0.86        21
        Past       0.86      1.00      0.92        18
    accuracy        -         -        0.87        39
   macro-avg       0.62      0.59      0.60        39
weighted-avg       0.93      0.87      0.89        39
```



**Contextual Assertion and AssertionDL Pipeline:**

```python
clinical_assertion_pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        contextual_assertion_absent, # Rule based Contextual Assertion
        contextual_assertion_past,   # Rule based Contextual Assertion
        assertionDL,                 # Pratrained AssertionDL Model
        assertionMerger
    ])

```

*Contextual Assertion and AssertionDL Results:*
```bash
       label  precision    recall  f1-score   support
      Absent       0.90      0.86      0.88        21
        Past       0.62      1.00      0.77        18
     Present       1.00      0.23      0.38        13
    accuracy        -         -        0.75        52
   macro avg       0.84      0.70      0.67        52
weighted avg       0.83      0.75      0.71        52

```


This benchmark highlights the significant improvements in precision, recall, and F1-score achieved by integrating Contextual Assertion with the AssertionDL model, making it a robust tool for accurate clinical text annotation.


- `Dataset`: Used in-house jsl_augmented dataset


F1 scores:

|Assertion Label|Contextual Assertion|AssertionDL|
|-|-|-|
|Absent       |0.82|0.90|
|Family       |0.63|0.73|
|Hypothetical |0.51|0.69|
|Past         |0.73|0.77|
|Planned      |0.57|0.62|
|Possible     |0.49|0.74|
|SomeoneElse  |0.79|0.84|