![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/02.0.Clinical_Assertion_Model.ipynb)

# Clinical Assertion Status Model


The deep neural network architecture for assertion status detection in Spark NLP is based on a Bi-LSTM framework, and is a modified version of the architecture proposed by Federico Fancellu, Adam Lopez and Bonnie Webber ([Neural Networks For Negation Scope Detection](https://aclanthology.org/P16-1047.pdf)). Its goal is to classify the assertions made on given medical concepts as being present, absent, or possible in the patient, conditionally present in the patient under certain circumstances,
hypothetically present in the patient at some future point, and
mentioned in the patient report but associated with someoneelse.
In the proposed implementation, input units depend on the
target tokens (a named entity) and the neighboring words that
are explicitly encoded as a sequence using word embeddings.
Similar to paper mentioned above,  it is observed that 95% of the scope tokens (neighboring words) fall in a window of 9 tokens to the left and 15
to the right of the target tokens in the same dataset. Therefore, the same window size was implemented and the following parameters were used: learning
rate 0.0012, dropout 0.05, batch size 64 and a maximum sentence length 250. The model has been implemented within
Spark NLP as an annotator called AssertionDLModel. After
training 20 epoch and measuring accuracy on the official test
set, this implementation exceeds the latest state-of-the-art
accuracy benchmarks as summarized in the following table:

|Assertion Label|Spark NLP|Latest Best|
|-|-|-|
|Absent       |0.944 |0.937|
|Someone-else |0.904|0.869|
|Conditional  |0.441|0.422|
|Hypothetical |0.862|0.890|
|Possible     |0.680|0.630|
|Present      |0.953|0.957|
|micro F1     |0.939|0.934|


## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

**Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# nlp.install()
nlp.settings.enforce_versions=True
nlp.install(refresh_install=True)

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

# Clinical Assertion Models (with pretrained models)

| Index | Model | Entities |
|---:|:------------------------|:-|
|  1 | [assertion_dl](https://nlp.johnsnowlabs.com/2021/01/26/assertion_dl_en.html) | Present, Absent, Possible, conditional, hypothetical, associated_with_someone_else |
|  2 | [assertion_dl_biobert](https://nlp.johnsnowlabs.com/2021/01/26/assertion_dl_biobert_en.html) | Present, Absent, Possible, conditional, hypothetical, associated_with_someone_else |
|  3 | [assertion_dl_healthcare](https://nlp.johnsnowlabs.com/2020/09/23/assertion_dl_healthcare_en.html) | Present, Absent, Possible, conditional, hypothetical, associated_with_someone_else |
|  4 | [assertion_dl_large](https://nlp.johnsnowlabs.com/2020/05/21/assertion_dl_large_en.html) | Present, Absent, Possible, conditional, hypothetical, associated_with_someone_else |
|  5 | [assertion_dl_radiology](https://nlp.johnsnowlabs.com/2021/03/18/assertion_dl_radiology_en.html) | Confirmed, Suspected, Negative |
|  6 | [assertion_jsl](https://nlp.johnsnowlabs.com/2021/07/24/assertion_jsl_en.html) | Present, Absent, Possible, Planned, Someoneelse, Past, Family, Hypotetical |
|  7 | [assertion_jsl_large](https://nlp.johnsnowlabs.com/2021/07/24/assertion_jsl_large_en.html) | present, absent, possible, planned, someoneelse, past, hypothetical |
|  8 | [assertion_ml](https://nlp.johnsnowlabs.com/2020/01/30/assertion_ml_en.html) | Hypothetical, Present, Absent, Possible, Conditional, Associated_with_someone_else |
|  9 | [assertion_dl_scope_L10R10](https://nlp.johnsnowlabs.com/2022/03/17/assertion_dl_scope_L10R10_en_3_0.html) | hypothetical, associated_with_someone_else, conditional, possible, absent, present |
| 10 | [assertion_dl_biobert_scope_L10R10](https://nlp.johnsnowlabs.com/2022/03/24/assertion_dl_biobert_scope_L10R10_en_2_4.html) | hypothetical, associated_with_someone_else, conditional, possible, absent, present |
| 11 | [assertion_jsl_augmented](https://nlp.johnsnowlabs.com/2022/09/15/assertion_jsl_augmented_en.html) | Present, Absent, Possible, Planned, Past, Family, Hypotetical, SomeoneElse |
| 12 | [assertion_bert_classification_clinical_onnx](https://nlp.johnsnowlabs.com/2025/07/15/assertion_bert_classification_clinical_onnx_en.html) | Present, Past, Family, Absent, Hypothetical, Possible |
| 13 | [assertion_bert_classification_jsl_onnx](https://nlp.johnsnowlabs.com/2025/07/15/assertion_bert_classification_jsl_onnx_en.html) | Present, Planned, SomeoneElse, Past, Family, Absent, Hypothetical, Possible |
| 14 | [assertion_bert_classification_oncology_onnx](https://nlp.johnsnowlabs.com/2025/07/15/assertion_bert_classification_oncology_onnx_en.html) | Present, Past, Family, Absent, Hypothetical, Possible |
| 15 | [assertion_bert_classification_radiology_onnx](https://nlp.johnsnowlabs.com/2025/07/15/assertion_bert_classification_radiology_onnx_en.html) | Confirmed, Suspected, Negative |
| 16 | [assertion_bert_classification_radiology](https://nlp.johnsnowlabs.com/2025/04/28/assertion_bert_classification_radiology_en.html) | Present, Absent, Conditional, Associated_with_someone_else, Hypothetical, Possible |
| 17 | [assertion_bert_classification_jsl](https://nlp.johnsnowlabs.com/2025/04/28/assertion_bert_classification_jsl_en.html) | Present, Planned, SomeoneElse, Past, Family, Absent, Hypothetical, Possible |
| 18 | [assertion_bert_classification_clinical](https://nlp.johnsnowlabs.com/2025/04/04/assertion_bert_classification_clinical_en.html) | absent, present, conditional, associated_with_someone_else, hypothetical, possible |
| 19 | [contextual_assertion_conditional](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_conditional_en.html) | conditional |
| 20 | [contextual_assertion_possible](https://nlp.johnsnowlabs.com/2025/03/12/contextual_assertion_possible_en.html) | possible |
| 21 | [assertion_genomic_abnormality_wip](https://nlp.johnsnowlabs.com/2025/01/16/assertion_genomic_abnormality_wip_en.html) | Normal, Affected, Variant |


### Pretrained `assertion_jsl_augmented` model

In [None]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\
    #.setIncludeAllConfidenceScores(False)

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["SYMPTOM","VS_FINDING","DISEASE_SYNDROME_DISORDER","ADMISSION_DISCHARGE","PROCEDURE"])

# Assertion model trained on i2b2 (sampled from MIMIC) dataset
clinical_assertion = medical.AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        clinical_assertion
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
Approximate size to download 14.5 MB
[OK!]
assertion_jsl_augmented download started this may take some time.
Approximate size to download 6.2 MB
[OK!]


In [None]:
medical.AssertionDLApproach().extractParamMap()

{Param(parent='AssertionDLApproach_d28daf6b6e04', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='AssertionDLApproach_d28daf6b6e04', name='label', doc='Column with one label per document'): 'label',
 Param(parent='AssertionDLApproach_d28daf6b6e04', name='batchSize', doc='Size for each batch in the optimization process'): 64,
 Param(parent='AssertionDLApproach_d28daf6b6e04', name='epochs', doc='Number of epochs for the optimization process'): 5,
 Param(parent='AssertionDLApproach_d28daf6b6e04', name='learningRate', doc='Learning rate for the optimization process'): 0.0012,
 Param(parent='AssertionDLApproach_d28daf6b6e04', name='dropout', doc='Dropout at the output of each layer'): 0.05,
 Param(parent='AssertionDLApproach_d28daf6b6e04', name='maxSentLen', doc='Max length for an input sentence.'): 250,
 Param(parent='AssertionDLApproach_d28daf6b6e04', name='includeConfidence', doc='whether to include confidence scores in a

In [None]:
# we also have a LogReg based Assertion Model.
'''
clinical_assertion_ml = AssertionLogRegModel.pretrained("assertion_ml", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
'''

In [None]:
text = """
GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
HEENT: Sclerae showed mild arcus senilis in the right. Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
ABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
EXTREMITIES: There is some edema, but no cyanosis and clubbing .
IMPRESSION: At this time is refractory anemia, which is transfusion dependent. He is on B12, iron, folic acid, and Procrit. There are no sign or symptom of blood loss and the previous esophagogastroduodenoscopy was negative. His creatinine was 1.
  My impression at this time is that he probably has an underlying myelodysplastic syndrome or bone marrow failure. His creatinine on this hospitalization was up slightly to 1.6 and this may contribute to his anemia.
  At this time, my recommendation for the patient is that he should undergo a bone marrow aspiration.
  I have discussed the procedure in detail which the patient. I have discussed the risks, benefits, and successes of that treatment and usefulness of the bone marrow and predicting his cause of refractory anemia and further therapeutic interventions, which might be beneficial to him.
  He is willing to proceed with the studies I have described to him. We will order an ultrasound of his abdomen because of the possible fullness of the spleen.
  As always, we greatly appreciate being able to participate in the care of your patient. We appreciate the consultation of the patient.
"""

In [None]:
light_model = nlp.LightPipeline(model)

light_result = light_model.fullAnnotate(text)[0]

chunks=[]
entities=[]
status=[]
confidence=[]

for n,m in zip(light_result['ner_chunk'],light_result['assertion']):

    chunks.append(n.result)
    entities.append(n.metadata['entity'])
    status.append(m.result)
    confidence.append(m.metadata['confidence'])

df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status, 'confidence':confidence})

df

Unnamed: 0,chunks,entities,assertion,confidence
0,distress,Symptom,Absent,0.9999
1,arcus senilis,Disease_Syndrome_Disorder,Past,1.0
2,jugular venous pressure distention,Symptom,Absent,1.0
3,adenopathy,Symptom,Absent,1.0
4,tender,Symptom,Absent,1.0
5,fullness,Symptom,Possible,1.0
6,edema,Symptom,Present,1.0
7,cyanosis,VS_Finding,Absent,1.0
8,clubbing,Symptom,Absent,1.0
9,anemia,Disease_Syndrome_Disorder,Hypothetical,0.9758


In [None]:
vis = nlp.viz.AssertionVisualizer()

vis.display(light_result, 'ner_chunk', 'assertion')

### PipelineTracer and PipelineOutputParser
####  Automating Pipeline Tracing and Analysis with `PipelineTracer` to Help Return Structured JSONs from Pretrained Pipelines Via the `PipelineOuputParser` module

- `PipelineTracer` is a flexible class that tracks every stage of a pipeline. It provides detailed information about entities, assertions, de-identification, classification and relationships. This class also helps to build parser dictionaries to create a `PipelineOutputParser`. Some of the central functionality includes printing the pipeline schema, creating parser dictionaries, and retrieving possible assertions, relationships, and entities. Provide easy access to parser dictionaries and existing pipeline diagrams. Please see [PipelineTracer and PipelineOutputParser Notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.4.PipelineTracer_and_PipelineOutputParser.ipynb) for more details

In [None]:
pipeline_tracer = medical.PipelineTracer(model)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "assertion_jsl_augmented"})
column_maps

{'document_identifier': 'assertion_jsl_augmented',
 'document_text': 'document',
 'entities': ['ner_chunk'],
 'assertions': ['assertion'],
 'resolutions': [],
 'relations': [],
 'summaries': [],
 'deidentifications': [],
 'classifications': [],
 'mappers': []}

In [None]:
print("Entities: " , pipeline_tracer.getPossibleEntities())

print("Assertions: ",  pipeline_tracer.getPossibleAssertions())

Entities:  ['Admission_Discharge', 'VS_Finding', 'Symptom', 'Disease_Syndrome_Disorder', 'Procedure']
Assertions:  ['Family', 'Past', 'Hypothetical', 'Possible', 'SomeoneElse', 'Planned', 'Absent', 'Present']


In [None]:
light_model = nlp.LightPipeline(model)
light_result = light_model.fullAnnotate(text)

In [None]:
pipeline_parser = medical.PipelineOutputParser(column_maps)
result = pipeline_parser.run(light_result)

result['result'][0]

{'document_identifier': 'assertion_jsl_augmented',
 'document_id': 0,
 'document_text': ['\nGENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.\nHEENT: Sclerae showed mild arcus senilis in the right. Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.\nNECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.\nABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.\nEXTREMITIES: There is some edema, but no cyanosis and clubbing .\nIMPRESSION: At this time is refractory anemia, which is transfusion dependent. He is on B12, iron, folic acid, and Procrit. There are no sign or symptom of blood loss and the previous esophagogastroduo

In [None]:
entities_df = pd.DataFrame.from_dict(result["result"][0]["entities"])
entities_df

Unnamed: 0,chunk_id,chunk,begin,end,ner_label,ner_source,ner_confidence
0,68e2305b,distress,49,56,Symptom,ner_chunk,0.9441
1,18de06c2,arcus senilis,196,208,Disease_Syndrome_Disorder,ner_chunk,0.43245
2,6100d87d,jugular venous pressure distention,380,413,Symptom,ner_chunk,0.45412502
3,dd7b2694,adenopathy,428,437,Symptom,ner_chunk,0.9938
4,c6f560b3,tender,514,519,Symptom,ner_chunk,0.9851
5,b3ef7e62,fullness,540,547,Symptom,ner_chunk,0.9096
6,3f80f545,edema,665,669,Symptom,ner_chunk,0.9807
7,5b55524b,cyanosis,679,686,VS_Finding,ner_chunk,0.9196
8,88599138,clubbing,692,699,Symptom,ner_chunk,0.9959
9,10a57d45,anemia,742,747,Disease_Syndrome_Disorder,ner_chunk,0.9904


In [None]:
assertion_df = pd.DataFrame.from_dict(result["result"][0]["assertions"])
assertion_df

Unnamed: 0,chunk_id,chunk,assertion,assertion_confidence,assertion_source
0,68e2305b,distress,Absent,0.9999,assertion
1,18de06c2,arcus senilis,Past,1.0,assertion
2,6100d87d,jugular venous pressure distention,Absent,1.0,assertion
3,dd7b2694,adenopathy,Absent,1.0,assertion
4,c6f560b3,tender,Absent,1.0,assertion
5,b3ef7e62,fullness,Possible,1.0,assertion
6,3f80f545,edema,Present,1.0,assertion
7,5b55524b,cyanosis,Absent,1.0,assertion
8,88599138,clubbing,Absent,1.0,assertion
9,10a57d45,anemia,Hypothetical,0.9758,assertion


In [None]:
merged_df = pd.merge(entities_df, assertion_df,  on=['chunk_id', 'chunk']).drop(columns='chunk_id')
merged_df

Unnamed: 0,chunk,begin,end,ner_label,ner_source,ner_confidence,assertion,assertion_confidence,assertion_source
0,distress,49,56,Symptom,ner_chunk,0.9441,Absent,0.9999,assertion
1,arcus senilis,196,208,Disease_Syndrome_Disorder,ner_chunk,0.43245,Past,1.0,assertion
2,jugular venous pressure distention,380,413,Symptom,ner_chunk,0.45412502,Absent,1.0,assertion
3,adenopathy,428,437,Symptom,ner_chunk,0.9938,Absent,1.0,assertion
4,tender,514,519,Symptom,ner_chunk,0.9851,Absent,1.0,assertion
5,fullness,540,547,Symptom,ner_chunk,0.9096,Possible,1.0,assertion
6,edema,665,669,Symptom,ner_chunk,0.9807,Present,1.0,assertion
7,cyanosis,679,686,VS_Finding,ner_chunk,0.9196,Absent,1.0,assertion
8,clubbing,692,699,Symptom,ner_chunk,0.9959,Absent,1.0,assertion
9,anemia,742,747,Disease_Syndrome_Disorder,ner_chunk,0.9904,Hypothetical,0.9758,assertion


In [None]:
# Downloading sample datasets.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/mt_samples_10.csv

In [None]:
mt_samples_df = spark.createDataFrame(pd.read_csv("/content/mt_samples_10.csv", sep=',', index_col=["index"]).reset_index())

mt_samples_df.printSchema()

root
 |-- index: long (nullable = true)
 |-- text: string (nullable = true)



In [None]:
mt_samples_df.show()

+-----+--------------------+
|index|                text|
+-----+--------------------+
|    0|Sample Type / Med...|
|    1|Sample Type / Med...|
|    2|Sample Type / Med...|
|    3|Sample Type / Med...|
|    4|Sample Type / Med...|
|    5|Sample Type / Med...|
|    6|Sample Type / Med...|
|    7|Sample Type / Med...|
|    8|Sample Type / Med...|
|    9|Sample Type / Med...|
+-----+--------------------+



In [None]:
result = model.transform(mt_samples_df)

In [None]:
result.show()

+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|index|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|           assertion|
+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|    0|Sample Type / Med...|[{document, 0, 54...|[{document, 0, 24...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 68, 76, ...|[{assertion, 68, ...|
|    1|Sample Type / Med...|[{document, 0, 32...|[{document, 0, 26...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 68, 92, ...|[{assertion, 68, ...|
|    2|Sample Type / Med...|[{document, 0, 42...|[{document, 0, 14...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_

In [None]:
result.select('sentence.result').take(1)

[Row(result=['Sample Type / Medical Specialty:\nHematology - Oncology\nSample Name:\nDischarge Summary - Mesothelioma - 1\nDescription:\nMesothelioma, pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.', '(Medical Transcription Sample Report)\nPRINCIPAL DIAGNOSIS:\nMesothelioma.', 'SECONDARY DIAGNOSES:\nPleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.', 'PROCEDURES', '1. On August 24, 2007, decortication of the lung with pleural biopsy and transpleural fluoroscopy.', '2. On August 20, 2007, thoracentesis.', '3. On August 31, 2007, Port-A-Cath placement.', 'HISTORY AND PHYSICAL:\nThe patient is a 41-year-old Vietnamese female with a nonproductive cough that started last week.', 'She has had right-sided chest pain radiating to her back with fever starting yesterday.', 'She has a history of pericarditis and pericardectomy in May 2006 and developed cough with righ

In [None]:
import pyspark.sql.functions as F

result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.begin,
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata,
                                     result.assertion.result,
                                     result.assertion.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['3']['sentence']").alias("sent_id"),
              F.expr("cols['4']").alias("assertion"),
              F.expr("cols['5']['confidence']").alias("confidence") ).show(truncate=False)


+-------------------------+-----+---+-------------------------+-------+------------+----------+
|chunk                    |begin|end|ner_label                |sent_id|assertion   |confidence|
+-------------------------+-----+---+-------------------------+-------+------------+----------+
|Discharge                |68   |76 |Admission_Discharge      |0      |Past        |0.991     |
|pleural effusion         |132  |147|Disease_Syndrome_Disorder|0      |Present     |1.0       |
|anemia                   |171  |176|Disease_Syndrome_Disorder|0      |Family      |1.0       |
|ascites                  |179  |185|Disease_Syndrome_Disorder|0      |Hypothetical|0.9782    |
|esophageal reflux        |188  |204|Disease_Syndrome_Disorder|0      |Family      |1.0       |
|deep venous thrombosis   |222  |243|Disease_Syndrome_Disorder|0      |Family      |1.0       |
|Pleural effusion         |340  |355|Disease_Syndrome_Disorder|2      |Present     |1.0       |
|anemia                   |379  |384|Dis

### Pretrained `assertion_dl_radiology` model

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetectorDLModel\
    .pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model for radiology
radiology_ner = medical.NerModel.pretrained("ner_radiology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\
    #.setIncludeAllConfidenceScores(False)

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["ImagingFindings"])

# Assertion model trained on radiology dataset
radiology_assertion = medical.AssertionDLModel.pretrained("assertion_dl_radiology", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        radiology_ner,
        ner_converter,
        radiology_assertion
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
radiologyAssertion_model = nlpPipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_radiology download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
assertion_dl_radiology download started this may take some time.
Approximate size to download 2.4 MB
[OK!]


In [None]:
radiology_assertion.getClasses()

['Confirmed', 'Suspected', 'Negative']

In [None]:
# A sample text from a radiology report

text = """No right-sided pleural effusion or pneumothorax is definitively seen and there are mildly displaced fractures of the left lateral 8th and likely 9th ribs."""

In [None]:
data = spark.createDataFrame([[text]]).toDF("text")

In [None]:
result = radiologyAssertion_model.transform(data)

In [None]:
import pyspark.sql.functions as F

result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.metadata,
                                     result.assertion.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['sentence']").alias("sent_id"),
              F.expr("cols['2']").alias("assertion")).show(truncate=False)

+-------------------+---------------+-------+---------+
|chunk              |ner_label      |sent_id|assertion|
+-------------------+---------------+-------+---------+
|effusion           |ImagingFindings|0      |Negative |
|pneumothorax       |ImagingFindings|0      |Negative |
|displaced fractures|ImagingFindings|0      |Confirmed|
+-------------------+---------------+-------+---------+



## Writing a generic Assertion + NER function

In [None]:
from pyspark.sql import functions as F

In [None]:
def get_base_pipeline (embeddings = 'embeddings_clinical'):

    documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

    # Sentence Detector annotator, processes various sentences per line
    sentenceDetector = nlp.SentenceDetector()\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

    # Tokenizer splits words in a relevant format for NLP
    tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

    # Clinical word embeddings trained on PubMED dataset
    word_embeddings = nlp.WordEmbeddingsModel.pretrained(embeddings, "en", "clinical/models")\
        .setInputCols(["sentence", "token"])\
        .setOutputCol("embeddings")

    base_pipeline = nlp.Pipeline(
        stages=[
            documentAssembler,
            sentenceDetector,
            tokenizer,
            word_embeddings
    ])

    return base_pipeline



def get_clinical_assertion (embeddings, spark_df, nrows = 100, ner_model_name = 'ner_clinical', assertion_model_name="assertion_dl"):

  # NER model trained on i2b2 (sampled from MIMIC) dataset
    loaded_ner_model = medical.NerModel.pretrained(ner_model_name, "en", "clinical/models") \
        .setInputCols(["sentence", "token", "embeddings"]) \
        .setOutputCol("ner")

    ner_converter = medical.NerConverterInternal() \
        .setInputCols(["sentence", "token", "ner"]) \
        .setOutputCol("ner_chunk")

  # Assertion model trained on i2b2 (sampled from MIMIC) dataset
  # coming from sparknlp_jsl.annotator !!
    clinical_assertion = medical.AssertionDLModel.pretrained(assertion_model_name, "en", "clinical/models") \
        .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
        .setOutputCol("assertion")


    base_model = get_base_pipeline (embeddings)

    nlpPipeline = nlp.Pipeline(
        stages=[
            base_model,
            loaded_ner_model,
            ner_converter,
            clinical_assertion
    ])

    empty_data = spark.createDataFrame([[""]]).toDF("text")

    model = nlpPipeline.fit(empty_data)

    result = model.transform(spark_df.limit(nrows))

    result = result.withColumn("id", F.monotonically_increasing_id())

    result_df = result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                                     result.ner_chunk.metadata,
                                                     result.assertion.result,
                                                     result.assertion.metadata)).alias("cols")) \
                      .select(F.expr("cols['0']").alias("chunk"),
                              F.expr("cols['1']['entity']").alias("ner_label"),
                              F.expr("cols['2']").alias("assertion"),
                              F.expr("cols['3']['confidence']").alias("confidence"))\
                      .filter("ner_label!='O'")

    return result_df

In [None]:
embeddings = 'embeddings_clinical'

ner_model_name = 'ner_clinical_large'

nrows = 100

ner_df = get_clinical_assertion (embeddings, mt_samples_df, nrows, ner_model_name)

ner_df.show(30,truncate=50)

ner_clinical_large download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
assertion_dl download started this may take some time.
Approximate size to download 1.3 MB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
+----------------------------+---------+---------+----------+
|                       chunk|ner_label|assertion|confidence|
+----------------------------+---------+---------+----------+
|                Mesothelioma|  PROBLEM|  present|    0.9996|
|                Mesothelioma|  PROBLEM|  present|    0.9996|
|            pleural effusion|  PROBLEM|  present|    0.9998|
|         atrial fibrillation|  PROBLEM|  present|       1.0|
|                      anemia|  PROBLEM|  present|    0.9999|
|                     ascites|  PROBLEM|  present|    0.9999|
|           esophageal reflux|  PROBLEM|  present|    0.9999|
|      deep venous thrombosis|  PROBLEM|  present|    0.8533|
|            

In [None]:
embeddings = 'embeddings_clinical'

ner_model_name = 'ner_posology'

nrows = 100

ner_df = get_clinical_assertion (embeddings, mt_samples_df, nrows, ner_model_name)

ner_df.show()

ner_posology download started this may take some time.
Approximate size to download 13.8 MB
[OK!]
assertion_dl download started this may take some time.
Approximate size to download 1.3 MB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
+----------------+---------+------------+----------+
|           chunk|ner_label|   assertion|confidence|
+----------------+---------+------------+----------+
|        Coumadin|     DRUG|hypothetical|    0.8709|
|            1 mg| STRENGTH| conditional|    0.7772|
|           daily|FREQUENCY| conditional|    0.5086|
|      Amiodarone|     DRUG|hypothetical|    0.8589|
|          100 mg| STRENGTH|hypothetical|    0.6143|
|             p.o|    ROUTE|hypothetical|    0.7991|
|           daily|FREQUENCY|     present|    0.9074|
|        Coumadin|     DRUG|     present|    0.9999|
|         Lovenox|     DRUG|     present|    0.9994|
|           40 mg| STRENGTH|     present|    0.9982|
|  subcutane

In [None]:
embeddings = 'embeddings_clinical'

ner_model_name = 'ner_posology_greedy'

entry_data = spark.createDataFrame([["The patient did not take a capsule of Advil."]]).toDF("text")

ner_df = get_clinical_assertion (embeddings, entry_data, nrows, ner_model_name)

ner_df.show()

ner_posology_greedy download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
assertion_dl download started this may take some time.
Approximate size to download 1.3 MB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
+----------------+---------+---------+----------+
|           chunk|ner_label|assertion|confidence|
+----------------+---------+---------+----------+
|capsule of Advil|     DRUG|   absent|    0.9855|
+----------------+---------+---------+----------+



In [None]:
embeddings = 'embeddings_clinical'

ner_model_name = 'ner_clinical'

entry_data = spark.createDataFrame([["The patient has no fever"]]).toDF("text")

ner_df = get_clinical_assertion (embeddings, entry_data, nrows, ner_model_name)

ner_df.show()

ner_clinical download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
assertion_dl download started this may take some time.
Approximate size to download 1.3 MB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
+-----+---------+---------+----------+
|chunk|ner_label|assertion|confidence|
+-----+---------+---------+----------+
|fever|  PROBLEM|   absent|     0.998|
+-----+---------+---------+----------+



In [None]:
clinical_text = """
Patient with severe fever and sore throat.
He shows no stomach pain and he maintained on an epidural and PCA for pain control.
He also became short of breath with climbing a flight of stairs.
After CT, lung tumor located at the right lower lobe. Father with Alzheimer.
"""
light_model = nlp.LightPipeline(model)

light_result = light_model.fullAnnotate(clinical_text)

In [None]:
column_maps = {
    'document_identifier': 'ner_clinical_pipeline',
    'document_text': 'document',
    'entities': ['ner_chunk'],
    'assertions': ['assertion']
}

pipeline_parser = medical.PipelineOutputParser(column_maps,)
result = pipeline_parser.run(light_result) #light_result is defined above

assertions_df = pd.DataFrame(result['result'][0]['assertions'])
entities_df = pd.DataFrame(result['result'][0]['entities'])

merged_df = pd.merge(entities_df, assertions_df,  on=['chunk_id', 'chunk']).drop(columns='chunk_id')

merged_df

Unnamed: 0,chunk,begin,end,ner_label,ner_source,ner_confidence,assertion,assertion_confidence,assertion_source
0,fever,21,25,VS_Finding,ner_chunk,0.9943,Present,1.0,assertion
1,sore throat,31,41,Symptom,ner_chunk,0.69635,Present,1.0,assertion
2,stomach pain,56,67,Symptom,ner_chunk,0.85885,Absent,1.0,assertion
3,pain,114,117,Symptom,ner_chunk,0.9864,Hypothetical,1.0,assertion
4,short of breath,143,157,Symptom,ner_chunk,0.6305,Present,1.0,assertion
5,climbing a flight of stairs,164,190,Symptom,ner_chunk,0.54858005,Present,0.9434,assertion
6,Alzheimer,259,267,Disease_Syndrome_Disorder,ner_chunk,0.9796,Family,0.8136,assertion


# Replace Assertion Labels

In [None]:
# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner = medical.NerModel.pretrained("ner_clinical_large","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

# Assertion model trained on i2b2 (sampled from MIMIC) dataset
clinical_replaced_assertion = medical.AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("replaced_assertion") \
    .setIncludeConfidence(True) \
    .setReplaceLabels({"Present":"Exist",
                       "Absent": "None",
                       "Conditional": "Possible",
                       "Hypothetical": "Possible"})

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    clinical_replaced_assertion
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

ner_clinical_large download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
assertion_jsl_augmented download started this may take some time.
Approximate size to download 6.2 MB
[OK!]


In [None]:
text = """
Patient with severe fever and sore throat.
He shows no stomach pain and he maintained on an epidural and PCA for pain control.
He also became short of breath with climbing a flight of stairs.
After CT, lung tumor located at the right lower lobe. Father with Alzheimer.
"""

light_model = nlp.LightPipeline(model)
light_result = light_model.fullAnnotate(text)

chunks=[]
entities=[]
confidence=[]
status=[]

for assertion_row in light_result[0]["replaced_assertion"]:
  chunk_id = assertion_row.metadata["chunk"]
  for chunk_row in light_result[0]["ner_chunk"]:
    if chunk_id == chunk_row.metadata["chunk"]:
        chunks.append(chunk_row.result)
        entities.append(chunk_row.metadata['entity'])
        status.append(assertion_row.result)
        confidence.append(assertion_row.metadata['confidence'])

df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'replaced_assertion':status, 'confidence':confidence})
df     # "Present" replaced with "Exist", "Absent" replaces with "None", and "Hypotetical" replaced with "Possible"

Unnamed: 0,chunks,entities,replaced_assertion,confidence
0,severe fever,PROBLEM,Exist,1.0
1,sore throat,PROBLEM,Exist,1.0
2,stomach pain,PROBLEM,,1.0
3,an epidural,TREATMENT,Exist,1.0
4,PCA,TREATMENT,Past,0.9978
5,pain control,PROBLEM,Possible,1.0
6,short of breath,PROBLEM,Exist,1.0
7,CT,TEST,Past,0.9963
8,lung tumor,PROBLEM,Exist,1.0
9,Alzheimer,PROBLEM,Family,0.8136


# Entity Type Constraints


You can effortlessly constrain assertions based on specific entity types using a convenient dictionary format: `{"entity": [assertion_label1, assertion_label2, .. assertion_labelN]}`. When an entity is not found in the dictionary, no constraints are applied, ensuring flexibility in your data processing. With the `setEntityAssertionCaseSensitive` parameter, you can control the case sensitivity for both entities and assertion labels. Unleash the full potential of your NLP model with these cutting-edge additions to the AssertionDLModel.

In [None]:
# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner = medical.NerModel.pretrained("ner_clinical_large","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

# Assertion model trained on i2b2 (sampled from MIMIC) dataset
clinical_assertion = medical.AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")\
    .setEntityAssertionCaseSensitive(False)\


nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        clinical_assertion
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

ner_clinical_large download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
assertion_jsl_augmented download started this may take some time.
Approximate size to download 6.2 MB
[OK!]


In [None]:
text = '''
A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting.
She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent laboratory findings on admission were: serum glucose 111 mg/dl,  creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, and venous pH 7.27.
'''
light_model = nlp.LightPipeline(model)

light_result = light_model.fullAnnotate(text)

In [None]:
pipeline_tracer = medical.PipelineTracer(model)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "assertion_jsl_replaced_label"})

pipeline_parser = medical.PipelineOutputParser(column_maps)
result = pipeline_parser.run(light_result)

result['result'][0].keys()

dict_keys(['document_identifier', 'document_id', 'document_text', 'entities', 'assertions', 'resolutions', 'relations', 'summaries', 'deidentifications', 'classifications'])

In [None]:
assertions_df = pd.DataFrame(result['result'][0]['assertions'])
entities_df = pd.DataFrame(result['result'][0]['entities'])

merged_df = pd.merge(entities_df, assertions_df,  on=['chunk_id', 'chunk']).drop(columns='chunk_id')

merged_df

Unnamed: 0,chunk,begin,end,ner_label,ner_source,ner_confidence,assertion,assertion_confidence,assertion_source
0,gestational diabetes mellitus,40,68,PROBLEM,ner_chunk,0.91976666,SomeoneElse,0.7177,assertion
1,subsequent type two diabetes mellitus,118,154,PROBLEM,ner_chunk,0.75924003,Exist,0.9912,assertion
2,T2DM,157,160,PROBLEM,ner_chunk,0.9917,SomeoneElse,0.7031,assertion
3,HTG-induced pancreatitis,185,208,PROBLEM,ner_chunk,0.97535,Past,0.9214,assertion
4,an acute hepatitis,265,282,PROBLEM,ner_chunk,0.9440667,Exist,0.8313,assertion
5,polyuria,322,329,PROBLEM,ner_chunk,0.9728,Family,1.0,assertion
6,poor appetite,332,344,PROBLEM,ner_chunk,0.9934,Family,1.0,assertion
7,vomiting,351,358,PROBLEM,ner_chunk,0.9854,Family,1.0,assertion
8,metformin,372,380,TREATMENT,ner_chunk,0.9998,Exist,0.5364,assertion
9,glipizide,383,391,TREATMENT,ner_chunk,0.9999,Exist,0.9993,assertion


# Assertion Filterer
AssertionFilterer will allow you to filter out the named entities by the list of acceptable assertion statuses by using method `setWhiteList()` or to exlude some entityes by using `.setBlackList()` method. This annotator would be quite handy if you want to set a white list / black list for the acceptable assertion statuses like present or conditional; or  put into black list if you dont want absent conditions get out of your pipeline.

In [None]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\
    #.setIncludeAllConfidenceScores(False)

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["PROBLEM", "TEST","TREATMENT"])

clinical_assertion = medical.AssertionDLModel.pretrained("assertion_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence","ner_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setWhiteList(["Present"])
#or .setBlackList([["absent"]])

nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        clinical_assertion,
        assertion_filterer
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
assertionFilter_model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
assertion_jsl download started this may take some time.
Approximate size to download 1.4 MB
[OK!]


In [None]:
text = 'Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. Alopecia noted. She denies pain.'

light_model = nlp.LightPipeline(assertionFilter_model)
light_result = light_model.fullAnnotate(text)

light_result[0].keys()

dict_keys(['assertion_filtered', 'document', 'ner_chunk', 'assertion', 'token', 'ner', 'embeddings', 'sentence'])

In [None]:
assertion_filterer.getWhiteList()

['Present']

In [None]:
column_maps = {
    'document_identifier': 'assertion_filterer_pipeline',
    'document_text': 'document',
    'entities': ['assertion_filtered'],
    'assertions': ['assertion'],
    'resolutions': [],
    'relations': [],
    'summaries': [],
    'deidentifications': [],
    'classifications': []
}

pipeline_parser = medical.PipelineOutputParser(column_maps)
result = pipeline_parser.run(light_result ) #light_result is defined above

assertions_df = pd.DataFrame(result['result'][0]['assertions'])
assertions_df

Unnamed: 0,chunk_id,chunk,assertion,assertion_confidence,assertion_source
0,e926d14d,a headache,Present,,assertion
1,0bf1e911,a head CT,Hypothetical,,assertion
2,d7134a98,anxious,Possible,,assertion
3,bb8100fe,Alopecia,Present,,assertion
4,cb5d4145,pain,Absent,,assertion


In [None]:
entities_df = pd.DataFrame(result['result'][0]['entities'])
entities_df

Unnamed: 0,chunk_id,chunk,begin,end,ner_label,ner_source,ner_confidence
0,e926d14d,a headache,12,21,PROBLEM,ner_chunk,0.97150004
1,bb8100fe,Alopecia,110,117,PROBLEM,ner_chunk,0.9949


In [None]:
assertions_df = pd.DataFrame(result['result'][0]['assertions'])

merged_df = pd.merge(entities_df, assertions_df,  on=['chunk_id', 'chunk']).drop(columns='chunk_id')

merged_df

Unnamed: 0,chunk,begin,end,ner_label,ner_source,ner_confidence,assertion,assertion_confidence,assertion_source
0,a headache,12,21,PROBLEM,ner_chunk,0.97150004,Present,,assertion
1,Alopecia,110,117,PROBLEM,ner_chunk,0.9949,Present,,assertion


As you see, there is no "pain" chunk since it has "absent" assertion label.

# Oncological Assertion Models

<div align="center">

|    | model_name              |Predicted Entities|
|---:|:------------------------|-|
| 1 | [assertion_oncology](https://nlp.johnsnowlabs.com/2024/07/03/assertion_oncology_en.html) | Absent, Family, Hypothetical, Past, Possible, Present|
| 2 | [assertion_oncology_problem](https://nlp.johnsnowlabs.com/2024/07/03/assertion_oncology_problem_en.html) |Family_History, Hypothetical_Or_Absent, Medical_History, Possible|
| 3 | [assertion_oncology_test_binary](https://nlp.johnsnowlabs.com/2024/07/03/assertion_oncology_treatment_binary_en.html) |Hypothetical_Or_Absent, Present_Or_Past|
| 4 | [assertion_oncology_response_to_treatment](https://nlp.johnsnowlabs.com/2024/07/03/assertion_oncology_response_to_treatment_en.html) |Hypothetical_Or_Absent, Present_Or_Past|
| 5 | [assertion_oncology_test_binary](https://nlp.johnsnowlabs.com/2024/07/03/assertion_oncology_test_binary_en.html) |Hypothetical_Or_Absent, Medical_History|
| 6 | [assertion_oncology_smoking_status](https://nlp.johnsnowlabs.com/2024/07/03/assertion_oncology_smoking_status_en.html) |Absent, Past, Present|
| 7 | [assertion_oncology_family_history](https://nlp.johnsnowlabs.com/2024/07/03/assertion_oncology_family_history_en.html) |Family_History, Other|
| 8 | [assertion_oncology_demographic_binary](https://nlp.johnsnowlabs.com/2024/07/03/assertion_oncology_demographic_binary_en.html) |Patient, Someone_Else|

</div>

In [None]:
embeddings = 'embeddings_clinical'

ner_model_name = 'ner_oncology_wip'

assertion_model_name='assertion_oncology_wip'

nrows = 100

ner_df = get_clinical_assertion (embeddings, mt_samples_df, nrows, ner_model_name, assertion_model_name )

ner_df.show(truncate = False)

ner_oncology_wip download started this may take some time.
Approximate size to download 963.8 KB
[OK!]
assertion_oncology_wip download started this may take some time.
Approximate size to download 1.4 MB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
+------------------------+--------------------+------------+----------+
|chunk                   |ner_label           |assertion   |confidence|
+------------------------+--------------------+------------+----------+
|Mesothelioma            |Cancer_Dx           |Present     |0.9885    |
|Mesothelioma            |Cancer_Dx           |Hypothetical|0.981     |
|August 24, 2007         |Date                |Past        |0.9726    |
|decortication           |Cancer_Surgery      |Past        |0.994     |
|lung                    |Site_Lung           |Past        |0.9453    |
|pleural                 |Site_Other_Body_Part|Past        |0.9624    |
|biopsy                  |Pathology_Te

# Voice of Patient Assertion Models

<div align="center">

|    | model_name              |Predicted Entities|
|---:|:------------------------|-|
| 1        | [assertion_vop_clinical](https://nlp.johnsnowlabs.com/2023/08/17/assertion_vop_clinical_en.html)     | Hypothetical_Or_Absent, Present_Or_Past, SomeoneElse |
| 2          | [assertion_vop_clinical_medium](https://nlp.johnsnowlabs.com/2023/08/17/assertion_vop_clinical_medium_en.html)       | Hypothetical_Or_Absent, Present_Or_Past, SomeoneElse |
| 3          | [assertion_vop_clinical_large](https://nlp.johnsnowlabs.com/2023/08/17/assertion_vop_clinical_large_en.html)       | Hypothetical_Or_Absent, Present_Or_Past, SomeoneElse |



</div>

[Assertion status model](https://nlp.johnsnowlabs.com/2023/08/17/assertion_vop_clinical_en.html) used to predict if an NER chunk refers to a positive finding from the patient (Present_Or_Past), or if it refers to a family member or another person (SomeoneElse) or if it is mentioned but not as something present (Hypothetical_Or_Absent).

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = medical.NerModel.pretrained("ner_vop", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setBlackList(['DATETIME',  'GENDER', 'AGE', 'SUBSTANCEQUANTITY','FORM', 'ADMISSIONDISCHARGE', 'TESTRESULT', 'TEST',
                  'MEDICALDEVICE','CLINICALDEPT','DRUG', 'ROUTE', 'DURATION',"DOSAGE",'FREQUENCY', 'BODYPART',
                   ])

assertion = medical.AssertionDLModel.pretrained("assertion_vop_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner,
        ner_converter,
        assertion
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

vop_pipeline_model = pipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_vop download started this may take some time.
Approximate size to download 3.7 MB
[OK!]
assertion_vop_clinical download started this may take some time.
Approximate size to download 919.9 KB
[OK!]


In [None]:
assertion.getClasses()


['Hypothetical_Or_Absent', 'Present_Or_Past', 'SomeoneElse']

In [None]:
sample_text = '''Hello, I am a 20-year-old woman who was diagnosed with hyperthyroidism around a month ago. For approximately four months, I've been experiencing symptoms such as feeling light-headed, battling poor digestion, dealing with anxiety attacks, depression, a sharp pain on my left side chest, an elevated heart rate, and a significant loss of weight. Due to these conditions, I was admitted to the hospital and just got discharged recently. During my hospital stay, a number of different tests were carried out by various physicians who initially struggled to pinpoint my actual medical condition. These tests included numerous blood tests, a brain MRI, an ultrasound scan, and an endoscopy. At long last, I was examined by a homeopathic doctor who finally diagnosed me with hyperthyroidism, indicating my TSH level was at a low 0.15 while my T3 and T4 levels were normal. Additionally, I was found to be deficient in vitamins B12 and D. Hence, I've been on a regimen of vitamin D supplements once a week and a daily dose of 1000 mcg of vitamin B12. I've been undergoing homeopathic treatment for the last 40 days and underwent a second test after a month which showed my TSH level increased to 0.5. While I'm noticing a slight improvement in my feelings of weakness and depression, over the last week, I've encountered two new challenges: difficulty breathing and a dramatically increased heart rate. I'm now at a crossroads where I am unsure if I should switch to allopathic treatment or continue with homeopathy. I understand that thyroid conditions take a while to improve, but I'm wondering if both treatments would require the same duration for recovery. Several of my acquaintances have recommended transitioning to allopathy and warn against taking risks, given the potential of developing severe complications. Please forgive any errors in my English and thank you for your understanding.'''

light_model = nlp.LightPipeline(vop_pipeline_model)

light_result = light_model.fullAnnotate(sample_text)

vis = nlp.viz.AssertionVisualizer()

vis.display(light_result[0], 'ner_chunk', 'assertion')


# Social Determinant of Health Assertion Models

<div align="center">

|    | model_name              |Predicted Entities|
|---------------|----------------------|---|
| 1        | [assertion_sdoh_wip](https://nlp.johnsnowlabs.com/2023/08/13/assertion_sdoh_wip_en.html)     | `Present`, `Absent`, `Someone_Else`, `Past`, `Hypothetical`, `Possible` |


</div>


In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

clinical_embeddings = nlp.WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = medical.NerModel.pretrained("ner_sdoh", "en", "clinical/models")\
    .setInputCols(["sentence", "token","embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(['sentence', 'token', 'ner'])\
    .setOutputCol('ner_chunk')\
    .setBlackList(['Age','Gender','Language','Healthcare_Institution'])   # I dont need these assertion of entities

assertion = medical.AssertionDLModel.pretrained("assertion_sdoh_wip", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        clinical_embeddings,
        ner_model,
        ner_converter,
        assertion
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_sdoh download started this may take some time.
Approximate size to download 2.8 MB
[OK!]
assertion_sdoh_wip download started this may take some time.
Approximate size to download 10.3 MB
[OK!]


In [None]:
assertion.getClasses()

['Absent', 'Present', 'Someone_Else', 'Past', 'Hypothetical', 'Possible']

In [None]:
sample_text= [
"""Smith works as a cleaning assistant and does not have access to health insurance or paid sick leave.
But she has generally housing problems. She lives in a apartment now.  She has long history of EtOH abuse, beginning in her teens.
She is aware she needs to attend Rehab Programs. She had DUI back in April and was due to be in court this week.
Her partner is an alcoholic and a drug abuser for the last 5 years.
She also mentioned feeling socially isolated and lack of a strong support system.
"""
]

light_model = nlp.LightPipeline(model)
light_result = light_model.fullAnnotate(sample_text)

light_result[0].keys()

dict_keys(['document', 'ner_chunk', 'assertion', 'token', 'ner', 'embeddings', 'sentence'])

In [None]:
pipeline_tracer = medical.PipelineTracer(model)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "assertion_sdoh"})

pipeline_parser = medical.PipelineOutputParser(column_maps)
result = pipeline_parser.run(light_result)

In [None]:
assertions_df = pd.DataFrame(result['result'][0]['assertions'])
entities_df = pd.DataFrame(result['result'][0]['entities'])

merged_df = pd.merge(entities_df, assertions_df,  on=['chunk_id', 'chunk']).drop(columns='chunk_id')

merged_df

Unnamed: 0,chunk,begin,end,ner_label,ner_source,ner_confidence,assertion,assertion_confidence,assertion_source
0,cleaning assistant,17,34,Employment,ner_chunk,0.76975,Present,0.7926,assertion
1,health insurance,64,79,Insurance_Status,ner_chunk,0.6325,Absent,0.5072,assertion
2,apartment,156,164,Housing,ner_chunk,0.9575,Present,0.9956,assertion
3,EtOH abuse,196,205,Alcohol,ner_chunk,0.8286,Past,0.6054,assertion
4,Rehab Programs,265,278,Access_To_Care,ner_chunk,0.6292,Hypothetical,0.5861,assertion
5,DUI,289,291,Legal_Issues,ner_chunk,0.9603,Past,0.5037,assertion
6,alcoholic,363,371,Alcohol,ner_chunk,0.997,Someone_Else,0.9868,assertion
7,drug abuser,379,389,Substance_Use,ner_chunk,0.89475,Someone_Else,0.9996,assertion
8,last 5 years,399,410,Substance_Duration,ner_chunk,0.5945,Someone_Else,0.9951,assertion
9,socially isolated,440,456,Social_Exclusion,ner_chunk,0.64390004,Present,0.9673,assertion


In [None]:
vis = nlp.viz.AssertionVisualizer()

vis.display(light_result[0], 'ner_chunk', 'assertion')

# AssertionChunkConverter

In some cases, there may be issues while creating the chunk column by using token indices and losing some data while training and testing the assertion status model if there are issues in these token indices. So we developed a new `AssertionChunkConverter` annotator that takes **begin and end indices of the chunks** as input and creates an extended chunk column with metadata that can be used for assertion status detection model training.

*NOTE*: Chunk begin and end indices in the assertion status model training dataframe can be populated using the new version of ALAB module.

In [None]:
data = spark.createDataFrame([
    ["An angiography showed bleeding in two vessels off of the Minnie supplying the sigmoid that were succesfully embolized.", "Minnie", 57, 63],
    ["After discussing this with his PCP , Leon was clear that the patient had had recurrent DVTs and ultimately a PE and his PCP felt strongly that he required long-term anticoagulation ", "PCP", 31, 34]])\
    .toDF("text", "target", "char_begin", "char_end")

data.show()

+--------------------+------+----------+--------+
|                text|target|char_begin|char_end|
+--------------------+------+----------+--------+
|An angiography sh...|Minnie|        57|      63|
|After discussing ...|   PCP|        31|      34|
+--------------------+------+----------+--------+



In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("tokens")

converter = medical.AssertionChunkConverter() \
    .setInputCols("tokens")\
    .setChunkTextCol("target")\
    .setChunkBeginCol("char_begin")\
    .setChunkEndCol("char_end")\
    .setOutputTokenBeginCol("token_begin")\
    .setOutputTokenEndCol("token_end")\
    .setOutputCol("chunk")

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentenceDetector,
    tokenizer,
    converter
])

results = pipeline.fit(data).transform(data)

In [None]:
results\
    .selectExpr(
        "target",
        "char_begin",
        "char_end",
        "token_begin",
        "token_end",
        "tokens[token_begin].result",
        "tokens[token_end].result",
        "target",
        "chunk")\
    .show(truncate=False)

+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|target|char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target|chunk                                         |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|Minnie|57        |63      |10         |10       |Minnie                    |Minnie                  |Minnie|[{chunk, 57, 62, Minnie, {sentence -> 0}, []}]|
|PCP   |31        |34      |5          |5        |PCP                       |PCP                     |PCP   |[{chunk, 31, 33, PCP, {sentence -> 0}, []}]   |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+



# AssertionMerger

AssertionMerger merges variety assertion columns coming from Assertion annotators such as AssertionDL and AssertionLogReg.
AssertionMerger can filter, prioritize, and merge assertion annotations by using proper parameters.

**Parameters:**

- `mergeOverlapping`: Whether to merge overlapping matched assertion annotations. Default: `True`
- `applyFilterBeforeMerge`: Whether to apply filtering before merging process. If `True`, filtering will be applied before merging; if `False`, filtering will be applied after merging process. Default: `False`.
- `blackList`: If defined, list of entities to ignore. The rest will be processed.
- `whiteList`: If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels.
- `caseSensitive`: Determines whether the definitions of the white listed and black listed entities are case sensitive. Default: `True`.
- `assertionsConfidence`: Pairs (assertion,confidenceThreshold) to filter assertions which have confidence lower than the confidence threshold.
- `orderingFeatures`: Specifies the ordering features to use for overlapping entities. Possible values include: 'begin', 'end', 'length', 'source', 'confidence'. Default: `['begin', 'length', 'source']`
- `electionStrategy`: Determines the strategy for selecting annotations. Annotations can be selected either sequentially based on their order (Sequential) or using a more diverse strategy (DiverseLonger). Currently, only Sequential and DiverseLonger options are available. Default: `Sequential`.
- `defaultConfidence` :  When the confidence value is included in the orderingFeatures and a given annotation does not have any confidence, this parameter determines the value to be used. The default value is `0`.
- `assertionSourcePrecedence`: Specifies the assertion sources to use for prioritizing overlapping annotations when the 'source' ordering feature is utilized. This parameter contains a comma-separated list of assertion sources that drive the prioritization. Annotations will be prioritized based on the order of the given string.
- `sortByBegin`: Whether to sort the annotations by begin at the end of the merge and filter process. Default: `False`.




In [None]:
from pyspark.sql.types import StringType

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on i2b2 (sampled from MIMIC) dataset
ner_jsl = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_jsl")\
    #.setIncludeAllConfidenceScores(False)

ner_jsl_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner_jsl"]) \
    .setOutputCol("ner_jsl_chunk")\
    .setWhiteList(["SYMPTOM","VS_FINDING","DISEASE_SYNDROME_DISORDER","ADMISSION_DISCHARGE","PROCEDURE"])

# Assertion model trained on i2b2 (sampled from MIMIC) dataset
assertion_jsl = medical.AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_jsl_chunk", "embeddings"]) \
    .setOutputCol("assertion_jsl")\
    .setEntityAssertionCaseSensitive(False)

ner_clinical = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_clinical")\
    #.setIncludeAllConfidenceScores(False)

ner_clinical_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner_clinical"]) \
    .setOutputCol("ner_clinical_chunk")\

# Assertion model trained on radiology dataset
assertion_dl = medical.AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_clinical_chunk", "embeddings"]) \
    .setOutputCol("assertion_dl")

from sparknlp_jsl.annotator import AssertionMerger
assertion_merger = AssertionMerger() \
    .setInputCols("assertion_jsl", "assertion_dl") \
    .setOutputCol("assertion_merger") \
    .setMergeOverlapping(True) \
    .setSelectionStrategy("sequential") \
    .setAssertionSourcePrecedence("assertion_dl, assertion_jsl") \
    .setCaseSensitive(False) \
    .setAssertionsConfidence({"past": 0.70}) \
    .setOrderingFeatures(["length", "source", "confidence"]) \
    .setDefaultConfidence(0.50)\
    #.setBlackList(["Hypothetical"])

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner_jsl,
        ner_jsl_converter,
        assertion_jsl,
        ner_clinical,
        ner_clinical_converter,
        assertion_dl,
        assertion_merger
])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
Approximate size to download 14.5 MB
[OK!]
assertion_jsl_augmented download started this may take some time.
Approximate size to download 6.2 MB
[OK!]
ner_clinical download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
assertion_dl download started this may take some time.
Approximate size to download 1.3 MB
[OK!]


In [None]:
data = spark.createDataFrame([
    """Patient had a headache for the last 2 weeks, and appears anxious when she walks fast. No alopecia noted. She denies pain. Her father is paralyzed and it is a stressor for her. She got antidepressant. We prescribed sleeping pills for her current insomnia."""
], StringType()).toDF("text")


data = data.coalesce(1).withColumn("idx", F.monotonically_increasing_id())
results = pipeline.fit(data).transform(data)

In [None]:
results.select("idx",F.explode(F.arrays_zip(results.assertion_merger.metadata,
                                            results.assertion_merger.begin,
                                            results.assertion_merger.end,
                                            results.assertion_merger.result)).alias("cols")) \
        .select("idx",F.expr("cols['0']['ner_chunk']").alias("ner_chunk"),
                F.expr("cols['1']").alias("begin"),
                F.expr("cols['2']").alias("end"),
                F.expr("cols['0']['ner_label']").alias("ner_label"),
                F.expr("cols['3']").alias("assertion"),
                F.expr("cols['0']['assertion_source']").alias("assertion_source"),
                F.expr("cols['0']['confidence']").alias("confidence"),
                ).sort("idx","begin").show(truncate=False)

+---+--------------+-----+---+---------+---------+----------------+----------+
|idx|ner_chunk     |begin|end|ner_label|assertion|assertion_source|confidence|
+---+--------------+-----+---+---------+---------+----------------+----------+
|0  |headache      |14   |21 |Symptom  |Past     |assertion_jsl   |0.9999    |
|0  |anxious       |57   |63 |PROBLEM  |present  |assertion_dl    |0.9392    |
|0  |alopecia      |89   |96 |PROBLEM  |absent   |assertion_dl    |0.9992    |
|0  |pain          |116  |119|PROBLEM  |absent   |assertion_dl    |0.9884    |
|0  |paralyzed     |136  |144|Symptom  |Family   |assertion_jsl   |0.9995    |
|0  |stressor      |158  |165|Symptom  |Family   |assertion_jsl   |1.0       |
|0  |antidepressant|184  |197|TREATMENT|present  |assertion_dl    |0.9628    |
|0  |sleeping pills|214  |227|TREATMENT|present  |assertion_dl    |0.998     |
|0  |insomnia      |245  |252|Symptom  |Past     |assertion_jsl   |0.9862    |
+---+--------------+-----+---+---------+---------+--

# Train a custom Assertion Model

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/i2b2_assertion_sample_short.csv

In [None]:
assertion_df = spark.read.option("header", True).option("inferSchema", "True").csv("i2b2_assertion_sample_short.csv")

In [None]:
assertion_df.show(3, truncate=100)

+-------------------------------------------------+-------------------+-------+-----+---+
|                                             text|             target|  label|start|end|
+-------------------------------------------------+-------------------+-------+-----+---+
|She has no history of liver disease , hepatitis .|      liver disease| absent|    5|  6|
|                         1. Undesired fertility .|undesired fertility|present|    1|  2|
|                            3) STATUS POST FALL .|               fall|present|    3|  3|
+-------------------------------------------------+-------------------+-------+-----+---+
only showing top 3 rows



In [None]:
(training_data, test_data) = assertion_df.randomSplit([0.8, 0.2], seed = 100)
print("Training Dataset Count: " + str(training_data.count()))
print("Test Dataset Count: " + str(test_data.count()))

Training Dataset Count: 721
Test Dataset Count: 170


In [None]:
training_data.groupBy('label').count().orderBy('count', ascending=False).show(truncate=False)

+-------+-----+
|label  |count|
+-------+-----+
|present|546  |
|absent |175  |
+-------+-----+



In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')

chunk2doc = medical.Doc2ChunkInternal()\
    .setInputCols(["document","token"])\
    .setOutputCol("chunk")\
    .setChunkCol("target")\
    .setStartCol("start")\
    .setStartColByTokenIndex(True)\
    .setFailOnMissing(False)\
    .setLowerCase(True)

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

clinical_assertion_pipeline = nlp.Pipeline(
    stages = [
        document,
        token,
        chunk2doc,
        embeddings
])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


We will transform our test data with a pipeline consisting of same steps with the pipeline which contains AssertionDLApproach.
By doing this, we enable that test data will have same columns with training data in AssertionDLApproach. <br/>
The goal of this implementation is enabling the usage of `setTestDataset()` parameter in AssertionDLApproach.

In [None]:
assertion_test_data = clinical_assertion_pipeline.fit(test_data).transform(test_data)

In [None]:
assertion_test_data.columns

['text',
 'target',
 'label',
 'start',
 'end',
 'document',
 'token',
 'chunk',
 'embeddings']

We save the test data in parquet format to use in `AssertionDLApproach()`.

In [None]:
assertion_test_data.write.mode('overwrite').parquet('i2b2_assertion_sample_test_data.parquet')

## Graph setup

In [None]:
# Install tensorflow-addons
!pip install git+https://github.com/tensorflow/addons.git

We will use TFGraphBuilder annotator which can be used to create graphs in the model training pipeline.

TFGraphBuilder inspects the data and creates the proper graph if a suitable version of TensorFlow (<= 2.7 ) is available. The graph is stored in the defined folder and loaded by the approach.

In [None]:
graph_folder= "./tf_graphs"

assertion_graph_builder = medical.TFGraphBuilder()\
    .setModelName("assertion_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("assertion_graph.pb")\
    .setMaxSequenceLength(250)\
    .setHiddenUnitsNumber(25)

In [None]:
'''
# ready to use tf_graph

!mkdir training_logs
!mkdir assertion_tf_graph

!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/tf_graphs/blstm_34_32_30_200_2.pb -P /content/assertion_tf_graph
'''

'''
# create custom graph

medical.tf_graph.print_model_params("assertion_dl")

feat_size = 200
n_classes = 6

medical.tf_graph.build("assertion_dl",
                        build_params={"n_classes": n_classes},
                        model_location= "./tf_graphs",
                        model_filename="blstm_34_32_30_{}_{}.pb".format(feat_size, n_classes))
'''

**Setting the Scope Window (Target Area) Dynamically in Assertion Status Detection Models**


This parameter allows you to train the Assertion Status Models to focus on specific context windows when resolving the status of a NER chunk. The window is in format `[X,Y]` being `X` the number of tokens to consider on the left of the chunk, and `Y` the max number of tokens to consider on the right. Let’s take a look at what different windows mean:


*   By default, the window is `[-1,-1]` which means that the Assertion Status will look at all of the tokens in the sentence/document (up to a maximum of tokens set in `setMaxSentLen()` ).
*   `[0,0]` means “don’t pay attention to any token except the ner_chunk”, what basically is not considering any context for the Assertion resolution.
*   `[9,15]` is what empirically seems to be the best baseline, meaning that we look up to 9 tokens on the left and 15 on the right of the ner chunk to understand the context and resolve the status.


Check this [Scope Window Tuning Assertion Status Detection notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Healthcare/2.1.Scope_window_tuning_assertion_status_detection.ipynb)  that illustrates the effect of the different windows and how to properly fine-tune your AssertionDLModels to get the best of them.

In our case, the best Scope Window is around [10,10]

In [None]:
scope_window = [10,10]

assertionStatus = medical.AssertionDLApproach()\
    .setLabelCol("label")\
    .setInputCols("document", "chunk", "embeddings")\
    .setOutputCol("assertion")\
    .setBatchSize(64)\
    .setDropout(0.1)\
    .setLearningRate(0.001)\
    .setEpochs(20)\
    .setValidationSplit(0.2)\
    .setStartCol("start")\
    .setEndCol("end")\
    .setMaxSentLen(250)\
    .setIncludeConfidence(True)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath('training_logs/')\
    .setGraphFolder(graph_folder)\
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")\
    .setTestDataset(path="/content/i2b2_assertion_sample_test_data.parquet")\
    .setScopeWindow(scope_window)

'''
If .setTestDataset parameter is employed, raw test data cannot be fitted. .setTestDataset only works for dataframes which are correctly transformed
by a pipeline consisting of document, chunk, embeddings stages.
'''

In [None]:
'''
assertionStatus = medical.AssertionLogRegApproach()\
    .setLabelCol("label")\
    .setInputCols("document", "chunk", "embeddings")\
    .setOutputCol("assertion")\
    .setMaxIter(100) # default: 26
'''

In [None]:
clinical_assertion_pipeline = nlp.Pipeline(
    stages = [
        document,
        token,
        chunk2doc,
        embeddings,
        assertion_graph_builder,
        assertionStatus
])

In [None]:
%%time

assertion_model = clinical_assertion_pipeline.fit(training_data)

TF Graph Builder configuration:
Model name: assertion_dl
Graph folder: ./tf_graphs
Graph file name: assertion_graph.pb
Build params: {'n_classes': 2, 'feat_size': 200, 'max_seq_len': 250, 'n_hidden': 25}


Instructions for updating:
non-resource variables are not supported in the long term


Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:00:04.0, compute capability: 8.0



Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:00:04.0, compute capability: 8.0

assertion_dl graph exported to ./tf_graphs/assertion_graph.pb
CPU times: user 8.26 s, sys: 621 ms, total: 8.88 s
Wall time: 1min 41s


## Checking the results

Checking the results saved in the log file

In [None]:
import os
log_files = os.listdir("./training_logs")
log_files

['AssertionDLApproach_8d3d56cc1a8a.log']

In [None]:
with open("./training_logs/"+log_files[0]) as log_file:
    print(log_file.read())

Name of the selected graph: ./tf_graphs/assertion_graph.pb
Training started, trainExamples: 721


Epoch: 0 started, learning rate: 0.001, dataset size: 577
Done, 4.123974686 total training loss: 5.368767, avg training loss: 0.5368767, batches: 10
Quality on validation dataset (20.0%), validation examples = 144
time to finish evaluation: 0.84s
Total validation loss: 1.6325	Avg validation loss: 0.5442
label	 tp	 fp	 fn	 prec	 rec	 f1
present	 108	 36	 0	 0.75	 1.0	 0.85714287
absent	 0	 0	 36	 0.0	 0.0	 0.0
tp: 108 fp: 36 fn: 36 labels: 2
Macro-average	 prec: 0.375, rec: 0.5, f1: 0.42857143
Micro-average	 prec: 0.75, rec: 0.75, f1: 0.75


Quality on test dataset: 
time to finish evaluation: 0.70s
Total test loss: 1.8458	Avg test loss: 0.6153
label	 tp	 fp	 fn	 prec	 rec	 f1
present	 117	 53	 0	 0.6882353	 1.0	 0.815331
absent	 0	 0	 53	 0.0	 0.0	 0.0
tp: 117 fp: 53 fn: 53 labels: 2
Macro-average	 prec: 0.34411764, rec: 0.5, f1: 0.4076655
Micro-average	 prec: 0.6882353, rec: 0.6882353, f1

In [None]:
preds = assertion_model.transform(test_data).selectExpr('label','assertion.result[0] as result').toPandas()
preds.head()

Unnamed: 0,label,result
0,present,present
1,absent,present
2,present,present
3,present,present
4,present,present


In [None]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report( preds['label'], preds['result']))

              precision    recall  f1-score   support

      absent       1.00      0.43      0.61        53
     present       0.80      1.00      0.89       117

    accuracy                           0.82       170
   macro avg       0.90      0.72      0.75       170
weighted avg       0.86      0.82      0.80       170



In [None]:
# save model
assertion_model.stages[-1].write().overwrite().save('assertion_custom_model')

## Load saved model

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
Approximate size to download 13.9 MB
[OK!]


In [None]:
clinical_assertion = medical.AssertionDLModel.load("assertion_custom_model") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    clinical_assertion
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)


In [None]:
text = 'Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. No alopecia and pain noted'

light_model = nlp.LightPipeline(model)

light_result = light_model.fullAnnotate(text)

In [None]:
pipeline_tracer = medical.PipelineTracer(model)

column_maps = pipeline_tracer.createParserDictionary()
column_maps.update({"document_identifier": "custom_model"})

pipeline_parser = medical.PipelineOutputParser(column_maps)
result = pipeline_parser.run(light_result)

In [None]:
assertions_df = pd.DataFrame(result['result'][0]['assertions'])
entities_df = pd.DataFrame(result['result'][0]['entities'])

merged_df = pd.merge(entities_df, assertions_df,  on=['chunk_id', 'chunk']).drop(columns='chunk_id')

merged_df

Unnamed: 0,chunk,begin,end,ner_label,ner_source,ner_confidence,assertion,assertion_confidence,assertion_source
0,a headache,12,21,PROBLEM,ner_chunk,0.97150004,present,0.9939,assertion
1,a head CT,58,66,TEST,ner_chunk,0.8149,present,0.999,assertion
2,anxious,81,87,PROBLEM,ner_chunk,0.9769,present,0.9926,assertion
3,alopecia,113,120,PROBLEM,ner_chunk,0.9994,absent,0.5614,assertion
4,pain,126,129,PROBLEM,ner_chunk,0.9993,absent,0.5807,assertion


## Extra Informations

**ExceptionHandling**

A robust exception handling if the process is broken down due to corrupted inputs. If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message. Processing continues with the next one while the rest of the records within the same batch is parsed without interruption. This comes with a performance penalty. The default behaviour is False and will throw exception and break the process to inform users.


*Example*:
```python
clinical_assertion =  medical.AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")\
    .setEntityAssertionCaseSensitive(False)
    .setDoExceptionHandling(True)
```