![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.4.BertForAssertionClassification.ipynb)

# 📜Bert For Sequence Classification Assertion

`BertAssertionClassifier` extracts the assertion status from text by analyzing both the extracted entities and their surrounding context. This classifier leverages pre-trained BERT models fine-tuned on biomedical text (e.g., BioBERT) and applies a sequence classification/regression head (a linear layer on the pooled output) to support multi-class document classification.

  **Key features:**
  - Accepts DOCUMENT and CHUNK type inputs and produces ASSERTION type annotations.
  - Emphasizes entity context by marking target entities with special tokens (e.g., [entity]), allowing the model to better focus on them.
  - Utilizes a transformer-based architecture (BERT for Sequence Classification) to achieve accurate assertion status prediction.

## **🎬 Colab Setup**

In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.5.0 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [4]:
import os
import json
import pandas as pd

import sparknlp
import sparknlp_jsl
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel

from sparknlp_jsl.pipeline_tracer import PipelineTracer
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser

from zipfile import ZipFile
from io import BytesIO

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"52G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

spark

Spark NLP Version : 6.0.0
Spark NLP_JSL Version : 6.0.0


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `CHUNK`

- Output: `ASSERTION`

## **🔎 Parameters**

**Parameters**:

- `configProtoBytes`: ConfigProto from tensorflow, serialized into byte array.
- `classificationCaseSensitive`: Whether to use case sensitive classification. Default is `True`.

        
    
        

## **🎈Model List with Predicted Entities**

| Model Name                                                            |      Predicted Entities            |
|-----------------------------------------------------------------------|-----------------------------|
| [`assertion_bert_classification_radiology`](https://nlp.johnsnowlabs.com/2025/04/28/assertion_bert_classification_radiology_en.html) |  `Confirmed`, `Suspected`, `Negative` |
| [`assertion_bert_classification_jsl`](https://nlp.johnsnowlabs.com/2025/04/28/assertion_bert_classification_jsl_en.html) |  `Present`, `Planned`, `SomeoneElse`, `Past`, `Family`, `Absent`, `Hypothetical`, `Possible` |
| [`assertion_bert_classification_clinical`](https://nlp.johnsnowlabs.com/2025/04/04/assertion_bert_classification_clinical_en.html) |  `absent`, `present`, `conditional`, `associated_with_someone_else`, `hypothetical`, `possible` |
| [`assertion_bert_classifier_jsl_slim`](https://nlp.johnsnowlabs.com/2025/02/13/assertion_bert_classifier_jsl_slim_en.html) |  `present`, `absent`, `possible` |

## **📍PIPELINE**

In [5]:
document_assembler = DocumentAssembler()\
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = SentenceDetector()\
    .setInputCols("document")\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")\
    .setCaseSensitive(False)

ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(["PROBLEM"])

assertion_classifier = BertForAssertionClassification.pretrained("assertion_bert_classification_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "ner_chunk"])\
    .setOutputCol("assertion_class")

pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    embeddings,
    ner,
    ner_converter,
    assertion_classifier
])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]
assertion_bert_classification_clinical download started this may take some time.
[OK!]


In [6]:
text = """
GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
HEENT: Sclerae showed mild arcus senilis in the right. Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
ABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
EXTREMITIES: There is some edema, but no cyanosis and clubbing .
IMPRESSION: At this time is refractory anemia, which is transfusion dependent. He is on B12, iron, folic acid, and Procrit. There are no sign or symptom of blood loss and the previous esophagogastroduodenoscopy was negative. His creatinine was 1.
  My impression at this time is that he probably has an underlying myelodysplastic syndrome or bone marrow failure. His creatinine on this hospitalization was up slightly to 1.6 and this may contribute to his anemia.
  At this time, my recommendation for the patient is that he should undergo a bone marrow aspiration.
  I have discussed the procedure in detail which the patient. I have discussed the risks, benefits, and successes of that treatment and usefulness of the bone marrow and predicting his cause of refractory anemia and further therapeutic interventions, which might be beneficial to him.
  He is willing to proceed with the studies I have described to him. We will order an ultrasound of his abdomen because of the possible fullness of the spleen.
  As always, we greatly appreciate being able to participate in the care of your patient. We appreciate the consultation of the patient.
"""

data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)

### **🎊Show result as data frame**

In [7]:
flattener = Flattener() \
    .setInputCols("assertion_class") \
    .setExplodeSelectedFields({"assertion_class":["metadata.ner_chunk as ner_chunk",
                                            "begin as begin",
                                            "end as end",
                                            "metadata.ner_label as ner_label",
                                            "result"]})
pipeline = Pipeline(
    stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    embeddings,
    ner,
    ner_converter,
    assertion_classifier,
    flattener
        ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

result_assertion = model.transform(data)
result_assertion.show(truncate=False)

+--------------------------------------------------------------+-----+----+---------+----------------------+
|ner_chunk                                                     |begin|end |ner_label|assertion_class_result|
+--------------------------------------------------------------+-----+----+---------+----------------------+
|acute distress                                                |43   |56  |PROBLEM  |absent                |
|mild arcus senilis in the right                               |191  |221 |PROBLEM  |present               |
|jugular venous pressure distention                            |380  |413 |PROBLEM  |absent                |
|adenopathy in the cervical, supraclavicular, or axillary areas|428  |489 |PROBLEM  |absent                |
|tender                                                        |514  |519 |PROBLEM  |absent                |
|some fullness in the left upper quadrant                      |535  |574 |PROBLEM  |possible              |
|some edema        

### **🎉Visualization**

In [8]:
from sparknlp_display import AssertionVisualizer

vis = AssertionVisualizer()

vis.display(result.collect()[0], 'ner_chunk', 'assertion_class')