![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/AssertionMerger.ipynb)

# **AssertionMerger**


This notebook will cover the usage of `AssertionMerger`. This annotator provides the ability to merge the same type of columns coming from two or more same annotators.

**📖 Learning Objectives:**

- Merging two or more same type annotation results in a spark nlp pipeline

**🔗 Helpful Links:**

- Documentation : [AssertionMerger](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#assertionmerger)  
- Python Docs : [AssertionMerger](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/assertion/assertion_merger/index.html)  
- Scala Docs : [AssertionMerger](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/assertion/merger/AssertionMerger.html)  




## ⚙️ Parameters

- **mergeOverlapping (Bool)**: Whether to merge overlapping matched assertions.  
- **applyFilterBeforeMerge (Bool)**: Whether to apply filtering before merging.  
- **assertionsConfidence (dict[str, float])**: Pairs `(assertion, confidenceThreshold)` to filter assertions which have confidence lower than the confidence threshold.  
- **orderingFeatures (list[str])**: Specifies the ordering features to use for overlapping entities. Possible values include: `begin`, `end`, `length`, `source`, `confidence`. Default: `[‘begin’, ‘length’, ‘source’]`  
- **selectionStrategy (str)**: Determines the strategy for selecting annotations. Options: `Sequential`, `DiverseLonger`. Default: `Sequential`.  
- **defaultConfidence (float)**: Confidence value to use when not available. Default: `0`.  
- **assertionSourcePrecedence (str)**: Comma-separated list of assertion sources for prioritizing overlapping annotations when using `source`.  
- **sortByBegin (Bool)**: Whether to sort annotations by `begin` at the end of merge/filter process. Default: `False`.  
- **blackList (list[str])**: Entities to ignore.  
- **whiteList (list[str])**: Entities to process (others ignored). Do not include IOB prefix.  
- **caseSensitive (Bool)**: Case sensitivity for black/white list. Default: `True`.  
- **majorityVoting (Bool)**: Whether to use majority voting to resolve conflicts. Default: `False`.  


## **📜 Background**

- `AssertionMerger` merges variety assertion columns coming from Assertion annotators such as **AssertionDL** and **AssertionLogReg**.

- It can filter, prioritize, and merge assertion annotations by using proper parameters.

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [3]:
from johnsnowlabs import nlp, medical

nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_10417 (1).json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-6.1.1-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-6.1.0-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-6.1.1.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-6.1.0.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_10417 (1).json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-6.1.0-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==6.1.0 installed! ✅ Heal the planet with NLP! 


In [None]:
import pyspark.sql.functions as F
import pandas as pd

spark = nlp.start()
spark

# Build a Assertion Pipeline

### Helper Function

In [5]:
from pyspark.sql import functions as F

def extract_assertions(results_df):
    """
    Explodes assertion_merger arrays and returns a flattened DataFrame with useful fields.

    Parameters
    ----------
    results_df : pyspark.sql.DataFrame
        DataFrame containing 'assertion_merger' column.

    Returns
    -------
    pyspark.sql.DataFrame
        Flattened DataFrame with columns:
        idx, ner_chunk, begin, end, ner_label, assertion,
        assertion_source, confidence
    """
    return (
        results_df.select("idx",F.explode(F.arrays_zip(results.assertion_merger.metadata,
                                            results.assertion_merger.begin,
                                            results.assertion_merger.end,
                                            results.assertion_merger.result)).alias("cols")) \
        .select("idx",F.expr("cols['0']['ner_chunk']").alias("ner_chunk"),
                F.expr("cols['1']").alias("begin"),
                F.expr("cols['2']").alias("end"),
                F.expr("cols['0']['ner_label']").alias("ner_label"),
                F.expr("cols['3']").alias("assertion"),
                F.expr("cols['0']['assertion_source']").alias("assertion_source"),
                F.expr("cols['0']['confidence']").alias("confidence"),
                ).sort("idx","begin").show(truncate=False)
    )

## Assertion Pipeline

In [7]:
from pyspark.sql.types import StringType

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_jsl = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_jsl")\

ner_jsl_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner_jsl"]) \
    .setOutputCol("ner_jsl_chunk")\
    .setWhiteList(["SYMPTOM","VS_FINDING","DISEASE_SYNDROME_DISORDER","ADMISSION_DISCHARGE","PROCEDURE"])

assertion_jsl = medical.AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_jsl_chunk", "embeddings"]) \
    .setOutputCol("assertion_jsl")\
    .setEntityAssertionCaseSensitive(False)

ner_clinical = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_clinical")\

ner_clinical_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner_clinical"]) \
    .setOutputCol("ner_clinical_chunk")\
    .setWhiteList(["PROBLEM"])

assertion_classifier = medical.BertForAssertionClassification.pretrained("assertion_bert_classification_jsl", "en", "clinical/models")\
    .setInputCols(["sentence", "ner_clinical_chunk"])\
    .setOutputCol("assertion_class")\

contextual_assertion = medical.ContextualAssertion.pretrained("contextual_assertion_past","en","clinical/models")\
    .setInputCols("sentence", "token", "ner_clinical_chunk") \
    .setOutputCol("assertion_past") \

assertion_merger = medical.AssertionMerger() \
    .setInputCols("assertion_jsl", "assertion_class", "assertion_past") \
    .setOutputCol("assertion_merger") \
    .setMergeOverlapping(True) \
    .setSelectionStrategy("sequential") \
    .setCaseSensitive(False) \
    .setOrderingFeatures(["source", "confidence", "begin"]) \
    .setDefaultConfidence(0.50)\

pipeline = nlp.Pipeline( stages =[document_assembler,
                              sentence_detector,
                              tokenizer,
                              word_embeddings,
                              ner_jsl,
                              ner_jsl_converter,
                              assertion_jsl,
                              ner_clinical,
                              ner_clinical_converter,
                              assertion_classifier,
                              contextual_assertion,
                              assertion_merger])


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
Approximate size to download 14.5 MB
[OK!]
assertion_jsl_augmented download started this may take some time.
Approximate size to download 6.2 MB
[OK!]
ner_clinical download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
assertion_bert_classification_jsl download started this may take some time.
Approximate size to download 387.4 MB
[OK!]
contextual_assertion_past download started this may take some time.
Approximate size to download 1.5 KB
[OK!]


In [8]:
text =  """Patient had a headache for the last 2 weeks, and appears anxious when she walks fast. No alopecia noted. She denies pain. Her father is paralyzed and it is a stressor for her. She got antidepressant. We prescribed sleeping pills for her current insomnia."""

data = spark.createDataFrame([text], StringType()).toDF("text")

data = data.coalesce(1).withColumn("idx", F.monotonically_increasing_id())

results = pipeline.fit(data).transform(data)

In [9]:
extract_assertions(results)

+---+--------------------+-----+---+---------+---------+----------------+----------+
|idx|ner_chunk           |begin|end|ner_label|assertion|assertion_source|confidence|
+---+--------------------+-----+---+---------+---------+----------------+----------+
|0  |headache            |14   |21 |Symptom  |Past     |assertion_jsl   |0.9999    |
|0  |anxious             |57   |63 |Symptom  |Possible |assertion_jsl   |1.0       |
|0  |alopecia            |89   |96 |Symptom  |Absent   |assertion_jsl   |1.0       |
|0  |pain                |116  |119|Symptom  |Absent   |assertion_jsl   |1.0       |
|0  |paralyzed           |136  |144|Symptom  |Family   |assertion_jsl   |0.9995    |
|0  |stressor            |158  |165|Symptom  |Family   |assertion_jsl   |1.0       |
|0  |her current insomnia|233  |252|PROBLEM  |Present  |assertion_class |0.9993082 |
+---+--------------------+-----+---+---------+---------+----------------+----------+



## setMergeOverlapping(False)

In [12]:
assertion_merger = medical.AssertionMerger() \
    .setInputCols("assertion_jsl", "assertion_class", "assertion_past") \
    .setOutputCol("assertion_merger") \
    .setMergeOverlapping(False) \
    .setSelectionStrategy("sequential") \
    .setCaseSensitive(False) \
    .setOrderingFeatures(["source", "confidence", "begin"])  \

pipeline = nlp.Pipeline( stages =[document_assembler,
                              sentence_detector,
                              tokenizer,
                              word_embeddings,
                              ner_jsl,
                              ner_jsl_converter,
                              assertion_jsl,
                              ner_clinical,
                              ner_clinical_converter,
                              assertion_classifier,
                              contextual_assertion,
                              assertion_merger])

In [13]:
data = data.coalesce(1).withColumn("idx", F.monotonically_increasing_id())
results = pipeline.fit(data).transform(data)

In [14]:
extract_assertions(results)

+---+--------------------+-----+---+---------+---------+----------------+----------+
|idx|ner_chunk           |begin|end|ner_label|assertion|assertion_source|confidence|
+---+--------------------+-----+---+---------+---------+----------------+----------+
|0  |a headache          |12   |21 |PROBLEM  |Present  |assertion_class |0.99922687|
|0  |a headache          |12   |21 |PROBLEM  |Past     |assertion_past  |0.8025    |
|0  |headache            |14   |21 |Symptom  |Past     |assertion_jsl   |0.9999    |
|0  |anxious             |57   |63 |Symptom  |Possible |assertion_jsl   |1.0       |
|0  |anxious             |57   |63 |PROBLEM  |Possible |assertion_class |0.9912843 |
|0  |anxious             |57   |63 |PROBLEM  |Past     |assertion_past  |0.3465    |
|0  |alopecia            |89   |96 |Symptom  |Absent   |assertion_jsl   |1.0       |
|0  |alopecia            |89   |96 |PROBLEM  |Absent   |assertion_class |0.9978027 |
|0  |alopecia            |89   |96 |PROBLEM  |Past     |assertion

By setting MergeOverlapping to False, the output includes all entities identified by all assertion models.



## setAssertionSourcePrecedence()

We can give priority to whichever model we want to get the results from first.

In [18]:
assertion_merger = medical.AssertionMerger() \
    .setInputCols("assertion_jsl", "assertion_class", "assertion_past") \
    .setOutputCol("assertion_merger") \
    .setMergeOverlapping(True) \
    .setSelectionStrategy("sequential") \
    .setCaseSensitive(False) \
    .setAssertionSourcePrecedence("assertion_class, assertion_jsl, assertion_past") \
    .setOrderingFeatures(["source", "confidence", "begin"])\

pipeline = nlp.Pipeline( stages =[document_assembler,
                              sentence_detector,
                              tokenizer,
                              word_embeddings,
                              ner_jsl,
                              ner_jsl_converter,
                              assertion_jsl,
                              ner_clinical,
                              ner_clinical_converter,
                              assertion_classifier,
                              contextual_assertion,
                              assertion_merger])

In [19]:
data = data.coalesce(1).withColumn("idx", F.monotonically_increasing_id())
results = pipeline.fit(data).transform(data)

In [20]:
extract_assertions(results)

+---+--------------------+-----+---+---------+---------+----------------+----------+
|idx|ner_chunk           |begin|end|ner_label|assertion|assertion_source|confidence|
+---+--------------------+-----+---+---------+---------+----------------+----------+
|0  |a headache          |12   |21 |PROBLEM  |Present  |assertion_class |0.99922687|
|0  |anxious             |57   |63 |PROBLEM  |Possible |assertion_class |0.9912843 |
|0  |alopecia            |89   |96 |PROBLEM  |Absent   |assertion_class |0.9978027 |
|0  |pain                |116  |119|PROBLEM  |Absent   |assertion_class |0.99867624|
|0  |paralyzed           |136  |144|Symptom  |Family   |assertion_jsl   |0.9995    |
|0  |stressor            |158  |165|Symptom  |Family   |assertion_jsl   |1.0       |
|0  |her current insomnia|233  |252|PROBLEM  |Present  |assertion_class |0.9993082 |
+---+--------------------+-----+---+---------+---------+----------------+----------+



## setWhiteList()

In [27]:
assertion_merger = medical.AssertionMerger() \
    .setInputCols("assertion_jsl", "assertion_class", "assertion_past") \
    .setOutputCol("assertion_merger") \
    .setMergeOverlapping(True) \
    .setSelectionStrategy("sequential") \
    .setCaseSensitive(False) \
    .setOrderingFeatures(["source", "confidence", "begin"])\
    .setWhiteList(["Past", "Family"])

pipeline = nlp.Pipeline( stages =[document_assembler,
                              sentence_detector,
                              tokenizer,
                              word_embeddings,
                              ner_jsl,
                              ner_jsl_converter,
                              assertion_jsl,
                              ner_clinical,
                              ner_clinical_converter,
                              assertion_classifier,
                              contextual_assertion,
                              assertion_merger])

In [28]:
data = data.coalesce(1).withColumn("idx", F.monotonically_increasing_id())

results = pipeline.fit(data).transform(data)

In [29]:
extract_assertions(results)

+---+---------+-----+---+---------+---------+----------------+----------+
|idx|ner_chunk|begin|end|ner_label|assertion|assertion_source|confidence|
+---+---------+-----+---+---------+---------+----------------+----------+
|0  |headache |14   |21 |Symptom  |Past     |assertion_jsl   |0.9999    |
|0  |paralyzed|136  |144|Symptom  |Family   |assertion_jsl   |0.9995    |
|0  |stressor |158  |165|Symptom  |Family   |assertion_jsl   |1.0       |
+---+---------+-----+---+---------+---------+----------------+----------+



With the `setWhiteList` applied, entities linked to the `Past` and `Family` assertion labels are included in the results, while others are excluded.

## setBlackList()

In [30]:
assertion_merger = medical.AssertionMerger() \
    .setInputCols("assertion_jsl", "assertion_class", "assertion_past") \
    .setOutputCol("assertion_merger") \
    .setMergeOverlapping(True) \
    .setSelectionStrategy("sequential") \
    .setAssertionSourcePrecedence("assertion_class, assertion_jsl, assertion_past") \
    .setCaseSensitive(False) \
    .setOrderingFeatures(["source", "confidence", "begin"])\
    .setBlackList(["Past", "Family"])

pipeline = nlp.Pipeline( stages =[document_assembler,
                              sentence_detector,
                              tokenizer,
                              word_embeddings,
                              ner_jsl,
                              ner_jsl_converter,
                              assertion_jsl,
                              ner_clinical,
                              ner_clinical_converter,
                              assertion_classifier,
                              contextual_assertion,
                              assertion_merger])

In [31]:
data = data.coalesce(1).withColumn("idx", F.monotonically_increasing_id())
results = pipeline.fit(data).transform(data)

In [32]:
extract_assertions(results)

+---+--------------------+-----+---+---------+---------+----------------+----------+
|idx|ner_chunk           |begin|end|ner_label|assertion|assertion_source|confidence|
+---+--------------------+-----+---+---------+---------+----------------+----------+
|0  |a headache          |12   |21 |PROBLEM  |Present  |assertion_class |0.99922687|
|0  |anxious             |57   |63 |PROBLEM  |Possible |assertion_class |0.9912843 |
|0  |alopecia            |89   |96 |PROBLEM  |Absent   |assertion_class |0.9978027 |
|0  |pain                |116  |119|PROBLEM  |Absent   |assertion_class |0.99867624|
|0  |her current insomnia|233  |252|PROBLEM  |Present  |assertion_class |0.9993082 |
+---+--------------------+-----+---+---------+---------+----------------+----------+




As you can see, the `Past` and `Family` assertion label and associated entities are filtered.

## setAssertionsConfidence()

In [33]:
assertion_merger = medical.AssertionMerger() \
    .setInputCols("assertion_jsl", "assertion_class", "assertion_past") \
    .setOutputCol("assertion_merger") \
    .setMergeOverlapping(True) \
    .setSelectionStrategy("sequential") \
    .setAssertionSourcePrecedence("assertion_class, assertion_jsl, assertion_past") \
    .setCaseSensitive(False) \
    .setOrderingFeatures(["source", "confidence", "begin"])\
    .setAssertionsConfidence({"Present": 0.95}) \

pipeline = nlp.Pipeline( stages =[document_assembler,
                              sentence_detector,
                              tokenizer,
                              word_embeddings,
                              ner_jsl,
                              ner_jsl_converter,
                              assertion_jsl,
                              ner_clinical,
                              ner_clinical_converter,
                              assertion_classifier,
                              contextual_assertion,
                              assertion_merger])

In [34]:
data = data.coalesce(1).withColumn("idx", F.monotonically_increasing_id())
results = pipeline.fit(data).transform(data)

In [35]:
extract_assertions(results)

+---+--------------------+-----+---+---------+---------+----------------+----------+
|idx|ner_chunk           |begin|end|ner_label|assertion|assertion_source|confidence|
+---+--------------------+-----+---+---------+---------+----------------+----------+
|0  |a headache          |12   |21 |PROBLEM  |Present  |assertion_class |0.99922687|
|0  |anxious             |57   |63 |PROBLEM  |Possible |assertion_class |0.9912843 |
|0  |alopecia            |89   |96 |PROBLEM  |Absent   |assertion_class |0.9978027 |
|0  |pain                |116  |119|PROBLEM  |Absent   |assertion_class |0.99867624|
|0  |paralyzed           |136  |144|Symptom  |Family   |assertion_jsl   |0.9995    |
|0  |stressor            |158  |165|Symptom  |Family   |assertion_jsl   |1.0       |
|0  |her current insomnia|233  |252|PROBLEM  |Present  |assertion_class |0.9993082 |
+---+--------------------+-----+---+---------+---------+----------------+----------+



This means that the `Present` label will only be accepted if the model assigns it a confidence score of 0.95 or higher. Predictions with lower confidence will be filtered out.