![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/academic/Assertion_Detection_Text2Story2025/Combined_Assertion_Pipeline.ipynb)



#Colab Setup

In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.4.1  spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install --upgrade -q spark-nlp-display

In [3]:
import json
import os

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.common import *
from sparknlp.training import *
from sparknlp.annotator import *

from sparknlp_jsl.base import *
from sparknlp_jsl.annotator import *

from pyspark.ml import Pipeline
from pyspark.sql.types import StringType
import pyspark.sql.types as T
import pyspark.sql.functions as F

import functools
import numpy as np
import pandas as pd
from scipy import spatial

params = {
    "spark.driver.memory":"32G",
    "spark.driver.maxResultSize":"5G",
    "spark.kryoserializer.buffer.max":"2000M",
    "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params, #gpu=True
                           )
print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.5.3
Spark NLP_JSL Version : 5.5.3


#Data Preparing

 This benchmark notebook was developed using the i2b2 dataset. Due to privacy and data use restrictions, we are unable to share the i2b2 dataset itself.

In [5]:
import pandas as pd

test_df = pd.read_csv("/content/i2b2_test_official_dataset.csv")
test_df["length"] = test_df["text"].apply(lambda x: len(x))
test_df['label'] = test_df['label'].replace('family', 'associated_with_someone_else')

test_df.label.value_counts(dropna=False)

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
present,8622
absent,2594
possible,652
hypothetical,445
conditional,148
associated_with_someone_else,131


#Pipeline Initial Components

This pipeline segment performs basic text preprocessing and chunk-to-token alignment. First, DocumentAssembler converts raw text into a format suitable for NLP processing. Then, Tokenizer splits the text into individual tokens. Finally, AssertionChunkConverter maps pre-annotated entity chunks (defined by their character-based begin and end positions) to their corresponding token indices, producing a new column ner_chunk that links chunks with token-level boundaries.

In [6]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

converter = AssertionChunkConverter() \
    .setInputCols("token")\
    .setChunkTextCol("chunk")\
    .setChunkBeginCol("begin")\
    .setChunkEndCol("end")\
    .setOutputTokenBeginCol("token_begin")\
    .setOutputTokenEndCol("token_end")\
    .setOutputCol("ner_chunk")


#Assertion DL




AssertionDL is a deep Learning based approach used to extract Assertion Status from extracted entities and text. AssertionDLModel requires DOCUMENT, CHUNK and WORD_EMBEDDINGS type annotator inputs, which can be obtained by e.g a DocumentAssembler, NerConverter and WordEmbeddingsModel. The result is an assertion status annotation for each recognized entity.

In [7]:
word_embeddings_100 = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings_100d")

clinical_assertion_100 = AssertionDLModel.pretrained("assertion_dl_healthcare","en","clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings_100d"]) \
    .setOutputCol("assertionDL")\
    .setEntityAssertionCaseSensitive(False)


embeddings_healthcare_100d download started this may take some time.
Approximate size to download 475.8 MB
[OK!]
assertion_dl_healthcare download started this may take some time.
[OK!]


#FewShot Assertion



FewShotAssertionClassifierModel does assertion classification using can run large (LLMS based) few shot classifiers based on the SetFit approach.

In [8]:
few_shot_assertion_converter = FewShotAssertionSentenceConverter()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("assertion_sentence")

e5_embeddings = E5Embeddings.pretrained("e5_base_v2_embeddings_medical_assertion_i2b2", "en", "clinical/models")\
    .setInputCols(["assertion_sentence"])\
    .setOutputCol("assertion_embedding")

few_shot_assertion_classifier = FewShotAssertionClassifierModel()\
    .pretrained("fewhot_assertion_i2b2_e5_base_v2_i2b2", "en", "clinical/models")\
    .setInputCols(["assertion_embedding"])\
    .setOutputCol("assertion_fewshot")

e5_base_v2_embeddings_medical_assertion_i2b2 download started this may take some time.
Approximate size to download 374.9 MB
[OK!]
fewhot_assertion_i2b2_e5_base_v2_i2b2 download started this may take some time.
[OK!]


#Contextual Assertion

An annotator model for contextual assertion analysis. This model identifies contextual cues within text data, such as negation, uncertainty etc. It is used clinical assertion detection. It annotates text chunks with assertions based on configurable rules, prefix and suffix patterns, and exception patterns.

In [9]:
#Detection `absent` assertion label
contextual_assertion_absent = ContextualAssertion\
    .pretrained("contextual_assertion_absent" ,"en" ,"clinical/models")\
    .setInputCols("sentence", "token", "ner_chunk") \
    .setAssertion("absent")\
    .setCaseSensitive(False)\
    .setIncludeChunkToScope(True)\
    .setScopeWindowDelimiters(["and","but"])\
    .setConfidenceCalculationDirection("both")\
    .setOutputCol("ca_absent") \

#Detection `possible` assertion label
contextual_assertion_possible = ContextualAssertion()\
    .pretrained("contextual_assertion_possible" ,"en" ,"clinical/models")\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setInputCols("sentence", "token", "ner_chunk") \
    .setAssertion("possible")\
    .setCaseSensitive(False)\
    .setIncludeChunkToScope(True)\
    .setScopeWindowDelimiters(["and","but"])\
    .setConfidenceCalculationDirection("both")\
    .setOutputCol("ca_possible")\

#Detection `conditional` assertion label
contextual_assertion_conditional = ContextualAssertion()\
    .pretrained("contextual_assertion_conditional" ,"en" ,"clinical/models")\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setInputCols("sentence", "token", "ner_chunk") \
    .setAssertion("conditional")\
    .setCaseSensitive(False)\
    .setIncludeChunkToScope(True)\
    .setScopeWindowDelimiters(["and","but"])\
    .setConfidenceCalculationDirection("both")\
    .setOutputCol("ca_conditional")\

#Detection `associated_with_someone_else` assertion label
contextual_assertion_associated = ContextualAssertion()\
    .pretrained("contextual_assertion_family" ,"en" ,"clinical/models")\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setInputCols("sentence", "token", "ner_chunk") \
    .setAssertion("associated_with_someone_else")\
    .setCaseSensitive(False)\
    .setIncludeChunkToScope(True)\
    .setScopeWindowDelimiters(["and","but"])\
    .setConfidenceCalculationDirection("both")\
    .setOutputCol("ca_associated")


contextual_assertion_absent download started this may take some time.
[OK!]
contextual_assertion_possible download started this may take some time.
[OK!]
contextual_assertion_conditional download started this may take some time.
[OK!]
contextual_assertion_family download started this may take some time.
[OK!]


#Merging

In [10]:
# Merging assertion stages
assertion_merger_fewshot = AssertionMerger()\
      .setInputCols("assertion_fewshot")\
      .setOutputCol("assertion_merger_fewshot")\
      .setWhiteList(["absent","hypothetical"])

assertion_merger_dl = AssertionMerger()\
      .setInputCols("assertionDL")\
      .setOutputCol("assertion_merger_dl")\
      .setWhiteList(["associated_with_someone_else","conditional"])

assertion_merger_all = AssertionMerger()\
      .setInputCols("assertionDL","assertion_fewshot","ca_possible")\
      .setOutputCol("assertion_merger_all")\
      .setMergeOverlapping(True)\
      .setMajorityVoting(False)\
      .setOrderingFeatures(["confidence"])\
      .setWhiteList(["present","possible"])\
      .setApplyFilterBeforeMerge(True)

assertion_merger_final = AssertionMerger()\
      .setInputCols("assertion_merger_fewshot","assertion_merger_dl","assertion_merger_all","ca_conditional")\
      .setOutputCol("assertion_merger")\
      .setMergeOverlapping(True)\
      .setMajorityVoting(True)\
      .setOrderingFeatures(["confidence"])

#Pipeline

In [11]:
pipeline = Pipeline(stages=[
            document_assembler,
            tokenizer,
            converter,
            few_shot_assertion_converter,
            e5_embeddings,
            few_shot_assertion_classifier,
            word_embeddings_100,
            clinical_assertion_100,
            assertion_merger_fewshot,
            contextual_assertion_conditional,
            contextual_assertion_possible,
            assertion_merger_dl,
            assertion_merger_all,
            assertion_merger_final
        ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
spark_test_df = spark.createDataFrame(test_df).repartition(os.cpu_count() * 4)

model = pipeline.fit(empty_data)

result_df = model.transform(spark_test_df)

#Benchmark

In [12]:
predictions = result_df.select("label","assertion_merger.result")
predictions_df = predictions.toPandas()

predictions_df['result'] = predictions_df['result'].apply(lambda x : x[0] if len(x) > 0 else "N/A")

In [13]:
predictions_df['label'] = predictions_df['label'].str.lower()
predictions_df['result'] = predictions_df['result'].str.lower()

In [14]:
from sklearn.metrics import classification_report

print(classification_report(predictions_df['label'], predictions_df['result'], digits=3))

                              precision    recall  f1-score   support

                      absent      0.951     0.956     0.954      2594
associated_with_someone_else      1.000     0.855     0.922       131
                 conditional      0.731     0.385     0.504       148
                hypothetical      0.867     0.861     0.864       445
                         n/a      0.000     0.000     0.000         0
                    possible      0.802     0.741     0.770       652
                     present      0.957     0.970     0.963      8622

                    accuracy                          0.943     12592
                   macro avg      0.758     0.681     0.711     12592
                weighted avg      0.942     0.943     0.942     12592


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


#Example

In [15]:
text = """
GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
HEENT: Sclerae showed mild arcus senilis in the right. Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
ABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
EXTREMITIES: There is some edema, but no cyanosis and clubbing .
IMPRESSION: At this time is refractory anemia, which is transfusion dependent. He is on B12, iron, folic acid, and Procrit. There are no sign or symptom of blood loss and the previous esophagogastroduodenoscopy was negative. His creatinine was 1.
  My impression at this time is that he probably has an underlying myelodysplastic syndrome or bone marrow failure. His creatinine on this hospitalization was up slightly to 1.6 and this may contribute to his anemia.
  At this time, my recommendation for the patient is that he should undergo a bone marrow aspiration.
  I have discussed the procedure in detail which the patient. I have discussed the risks, benefits, and successes of that treatment and usefulness of the bone marrow and predicting his cause of refractory anemia and further therapeutic interventions, which might be beneficial to him.
  He is willing to proceed with the studies I have described to him. We will order an ultrasound of his abdomen because of the possible fullness of the spleen.
  As always, we greatly appreciate being able to participate in the care of your patient. We appreciate the consultation of the patient.
"""

In [16]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\


ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["SYMPTOM","VS_FINDING","DISEASE_SYNDROME_DISORDER","ADMISSION_DISCHARGE"])


pipeline = Pipeline(stages=[
            documentAssembler,
            sentenceDetector,
            tokenizer,
            word_embeddings,
            clinical_ner,
            ner_converter,
            few_shot_assertion_converter,
            e5_embeddings,
            few_shot_assertion_classifier,
            word_embeddings_100,
            clinical_assertion_100,
            assertion_merger_fewshot,
            contextual_assertion_conditional,
            contextual_assertion_possible,
            assertion_merger_dl,
            assertion_merger_all,
            assertion_merger_final
        ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
model = pipeline.fit(empty_data)
light_model = LightPipeline(model)


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]


In [17]:
light_result = light_model.fullAnnotate(text)[0]

chunks=[]
entities=[]
status=[]
confidence=[]

for i in light_result['assertion_merger']:

    chunks.append(i.metadata['ner_chunk'])
    entities.append(i.metadata['ner_label'])
    status.append(i.result)
    confidence.append(i.metadata['confidence'])

df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status, 'confidence':confidence})
df

Unnamed: 0,chunks,entities,assertion,confidence
0,distress,Symptom,absent,0.95558286
1,arcus senilis,Disease_Syndrome_Disorder,present,0.999
2,jugular venous pressure distention,Symptom,absent,0.9551166
3,adenopathy,Symptom,absent,0.955725
4,tender,Symptom,absent,0.9558004
5,fullness,Symptom,possible,0.82425153
6,edema,Symptom,present,0.9315111
7,cyanosis,VS_Finding,absent,0.95566493
8,clubbing,Symptom,absent,0.9557143
9,anemia,Disease_Syndrome_Disorder,present,0.9549294


In [18]:
from sparknlp_display import AssertionVisualizer

vis = AssertionVisualizer()

vis.display(light_result, 'ner_chunk', 'assertion_merger')