![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Training and Reusing Assertion Status Models

In [0]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
#nlp.install()

In [0]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL

import os
import json
import string
import numpy as np
import pandas as pd

from pyspark.ml import Pipeline, PipelineModel

pd.set_option('max_colwidth', 100)
pd.set_option('display.max_columns', 100)  
pd.set_option('display.expand_frame_repr', False)

spark

# Clinical Assertion Model (with pretrained models)

The deep neural network architecture for assertion status detection in Spark NLP is based on a Bi-LSTM framework, and is a modified version of the architecture proposed by Federico Fancellu, Adam Lopez and Bonnie Webber ([Neural Networks For Negation Scope Detection](https://aclanthology.org/P16-1047.pdf)). Its goal is to classify the assertions made on given medical concepts as being present, absent, or possible in the patient, conditionally present in the patient under certain circumstances,
hypothetically present in the patient at some future point, and
mentioned in the patient report but associated with someoneelse.
In the proposed implementation, input units depend on the
target tokens (a named entity) and the neighboring words that
are explicitly encoded as a sequence using word embeddings.
Similar to paper mentioned above,  it is observed that that 95% of the scope tokens (neighboring words) fall in a window of 9 tokens to the left and 15
to the right of the target tokens in the same dataset. Therefore, the same window size was implemented and it following parameters were used: learning
rate 0.0012, dropout 0.05, batch size 64 and a maximum sentence length 250. The model has been implemented within
Spark NLP as an annotator called AssertionDLModel. After
training 20 epoch and measuring accuracy on the official test
set, this implementation exceeds the latest state-of-the-art
accuracy benchmarks as summarized as following table:

|Assertion Label|Spark NLP|Latest Best|
|-|-|-|
|Absent       |0.944 |0.937|
|Someone-else |0.904|0.869|
|Conditional  |0.441|0.422|
|Hypothetical |0.862|0.890|
|Possible     |0.680|0.630|
|Present      |0.953|0.957|
|micro F1     |0.939|0.934|

|index|model|
|-----:|:-----|
| 1| [assertion_dl](https://nlp.johnsnowlabs.com/2021/01/26/assertion_dl_en.html)  |
| 2| [assertion_dl_biobert_scope_L10R10](https://nlp.johnsnowlabs.com/2022/03/24/assertion_dl_biobert_scope_L10R10_en_2_4.html)  |
| 3| [assertion_dl_en](https://nlp.johnsnowlabs.com/2020/01/30/assertion_dl_en.html)  |
| 4| [assertion_dl_healthcare](https://nlp.johnsnowlabs.com/2020/09/23/assertion_dl_healthcare_en.html)  |
| 5| [assertion_dl_large_en](https://nlp.johnsnowlabs.com/2020/05/21/assertion_dl_large_en.html)  |
| 6| [assertion_dl_radiology](https://nlp.johnsnowlabs.com/2021/03/18/assertion_dl_radiology_en.html)  |
| 7| [assertion_dl_scope_L10R10](https://nlp.johnsnowlabs.com/2022/03/17/assertion_dl_scope_L10R10_en_3_0.html)  |
| 8| [assertion_jsl](https://nlp.johnsnowlabs.com/2021/07/24/assertion_jsl_en.html)  |
| 9| [assertion_jsl_augmented](https://nlp.johnsnowlabs.com/2022/09/15/assertion_jsl_augmented_en.html)  |
| 10| [assertion_jsl_large](https://nlp.johnsnowlabs.com/2021/07/24/assertion_jsl_large_en.html)  |
| 11| [assertion_ml_en](https://nlp.johnsnowlabs.com/2020/01/30/assertion_ml_en.html)  |
| 12| [jsl_assertion_wip](https://nlp.johnsnowlabs.com/2021/01/18/jsl_assertion_wip_en.html)  |
| 13| [jsl_assertion_wip_large](https://nlp.johnsnowlabs.com/2021/01/18/jsl_assertion_wip_large_en.html)  |

### Oncology Assertion Models
|index|model|
|-----:|:-----|
| 1| [assertion_oncology_demographic_binary_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_demographic_binary_wip_en.html)  |
| 2| [assertion_oncology_family_history_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_family_history_wip_en.html)  |
| 3| [assertion_oncology_problem_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_problem_wip_en.html)  |
| 4| [assertion_oncology_response_to_treatment_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_response_to_treatment_wip_en.html)  |
| 5| [assertion_oncology_smoking_status_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_smoking_status_wip_en.html)  |
| 6| [assertion_oncology_test_binary_wip](https://nlp.johnsnowlabs.com/2022/10/01/assertion_oncology_test_binary_wip_en.html)  |
| 7| [assertion_oncology_treatment_binary_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_treatment_binary_wip_en.html)  |
| 8| [assertion_oncology_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_wip_en.html)  |

### Pretrained `assertion_jsl_augmented` model

In [0]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\
    #.setIncludeAllConfidenceScores(False)

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["SYMPTOM","VS_FINDING","DISEASE_SYNDROME_DISORDER","ADMISSION_DISCHARGE","PROCEDURE"])

# Assertion model trained on i2b2 (sampled from MIMIC) dataset
clinical_assertion = medical.AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
    
nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    clinical_assertion
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ][OK!]
ner_jsl download started this may take some time.
[ | ][OK!]
assertion_jsl_augmented download started this may take some time.
[ | ][OK!]


In [0]:
medical.AssertionDLApproach().extractParamMap()

Out[9]: {Param(parent='AssertionDLApproach_37fa9d9f05ed', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='AssertionDLApproach_37fa9d9f05ed', name='label', doc='Column with one label per document'): 'label',
 Param(parent='AssertionDLApproach_37fa9d9f05ed', name='batchSize', doc='Size for each batch in the optimization process'): 64,
 Param(parent='AssertionDLApproach_37fa9d9f05ed', name='epochs', doc='Number of epochs for the optimization process'): 5,
 Param(parent='AssertionDLApproach_37fa9d9f05ed', name='learningRate', doc='Learning rate for the optimization process'): 0.0012,
 Param(parent='AssertionDLApproach_37fa9d9f05ed', name='dropout', doc='Dropout at the output of each layer'): 0.05,
 Param(parent='AssertionDLApproach_37fa9d9f05ed', name='maxSentLen', doc='Max length for an input sentence.'): 250,
 Param(parent='AssertionDLApproach_37fa9d9f05ed', name='includeConfidence', doc='whether to include confidence sco

In [0]:
# we also have a LogReg based Assertion Model.
'''
clinical_assertion_ml = AssertionLogRegModel.pretrained("assertion_ml", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
'''

Out[10]: '\nclinical_assertion_ml = AssertionLogRegModel.pretrained("assertion_ml", "en", "clinical/models")     .setInputCols(["sentence", "ner_chunk", "embeddings"])     .setOutputCol("assertion")\n'

In [0]:
text = """
GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
HEENT: Sclerae show mild arcus senilis in the right. Left is clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
ABDOMEN: Soft and nontender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
EXTREMITIES: There is some edema, but no cyanosis and clubbing .
IMPRESSION: At this time is refractory anemia, which is transfusion dependent. He is on B12, iron, folic acid, and Procrit. There are no sign or symptom of blood loss and a recent esophagogastroduodenoscopy, which was negative. His creatinine was 1. 
  My impression at this time is that he probably has an underlying myelodysplastic syndrome or bone marrow failure. His creatinine on this hospitalization was up slightly to 1.6 and this may contribute to his anemia.
  At this time, my recommendation for the patient is that he undergoes further serologic evaluation with reticulocyte count, serum protein, and electrophoresis, LDH, B12, folate, erythropoietin level, and he should undergo a bone marrow aspiration and biopsy. 
  I have discussed the procedure in detail which the patient. I have discussed the risks, benefits, and successes of that treatment and usefulness of the bone marrow and predicting his cause of refractory anemia and further therapeutic interventions, which might be beneficial to him. 
  He is willing to proceed with the studies I have described to him. We will order an ultrasound of his abdomen because of the possible fullness of the spleen, and I will probably see him in follow up after this hospitalization.
  As always, we greatly appreciate being able to participate in the care of your patient. We appreciate the consultation of the patient. 
"""

In [0]:
light_model = nlp.LightPipeline(model)

light_result = light_model.fullAnnotate(text)[0]

chunks=[]
entities=[]
status=[]
confidence=[]

for n,m in zip(light_result['ner_chunk'],light_result['assertion']):
    
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    status.append(m.result)
    confidence.append(m.metadata['confidence'])
        
df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status, 'confidence':confidence})

df

Unnamed: 0,chunks,entities,assertion,confidence
0,distress,Symptom,Absent,1.0
1,arcus senilis,Disease_Syndrome_Disorder,Past,1.0
2,jugular venous pressure distention,Symptom,Absent,1.0
3,adenopathy,Symptom,Absent,1.0
4,nontender,Symptom,Absent,1.0
5,fullness,Symptom,Possible,0.9999
6,edema,Symptom,Present,1.0
7,cyanosis,VS_Finding,Absent,1.0
8,clubbing,Symptom,Absent,1.0
9,anemia,Disease_Syndrome_Disorder,Hypothetical,0.9758


In [0]:
visualizer = nlp.viz.AssertionVisualizer()

vis = visualizer.display(light_result, 'ner_chunk', 'assertion', return_html=True)
#visualizer.set_label_colors({'TREATMENT':'#008080', 'PROBLEM':'#800080'})


displayHTML(vis)

In [0]:
#downloading the sample dataset
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/mt_samples_10.csv

In [0]:
dbutils.fs.cp("file:/databricks/driver/mt_samples_10.csv", "dbfs:/") 

Out[15]: True

In [0]:
import pandas as pd
mt_samples_df = spark.createDataFrame(pd.read_csv("mt_samples_10.csv", sep=',', index_col=["index"]).reset_index())

In [0]:
mt_samples_df.printSchema()

root
 |-- index: long (nullable = true)
 |-- text: string (nullable = true)



In [0]:
mt_samples_df.show()

+-----+--------------------+
|index|                text|
+-----+--------------------+
|    0|Sample Type / Med...|
|    1|Sample Type / Med...|
|    2|Sample Type / Med...|
|    3|Sample Type / Med...|
|    4|Sample Type / Med...|
|    5|Sample Type / Med...|
|    6|Sample Type / Med...|
|    7|Sample Type / Med...|
|    8|Sample Type / Med...|
|    9|Sample Type / Med...|
+-----+--------------------+



In [0]:
result = model.transform(mt_samples_df)

In [0]:
result.show()

+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|index|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|           assertion|
+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|    0|Sample Type / Med...|[{document, 0, 54...|[{document, 0, 24...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 68, 76, ...|[{assertion, 68, ...|
|    1|Sample Type / Med...|[{document, 0, 32...|[{document, 0, 26...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 68, 92, ...|[{assertion, 68, ...|
|    2|Sample Type / Med...|[{document, 0, 42...|[{document, 0, 14...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_

In [0]:
result.select('sentence.result').take(1)

Out[21]: [Row(result=['Sample Type / Medical Specialty:\nHematology - Oncology\nSample Name:\nDischarge Summary - Mesothelioma - 1\nDescription:\nMesothelioma, pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.', '(Medical Transcription Sample Report)\nPRINCIPAL DIAGNOSIS:\nMesothelioma.', 'SECONDARY DIAGNOSES:\nPleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.', 'PROCEDURES', '1. On August 24, 2007, decortication of the lung with pleural biopsy and transpleural fluoroscopy.', '2. On August 20, 2007, thoracentesis.', '3. On August 31, 2007, Port-A-Cath placement.', 'HISTORY AND PHYSICAL:\nThe patient is a 41-year-old Vietnamese female with a nonproductive cough that started last week.', 'She has had right-sided chest pain radiating to her back with fever starting yesterday.', 'She has a history of pericarditis and pericardectomy in May 2006 and developed cough 

In [0]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,  
                                     result.ner_chunk.begin, 
                                     result.ner_chunk.end, 
                                     result.ner_chunk.metadata, 
                                     result.assertion.result,
                                     result.assertion.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['3']['sentence']").alias("sent_id"),
              F.expr("cols['4']").alias("assertion"),
              F.expr("cols['5']['confidence']").alias("confidence") ).show(truncate=False)

+-------------------------+-----+---+-------------------------+-------+---------+----------+
|chunk                    |begin|end|ner_label                |sent_id|assertion|confidence|
+-------------------------+-----+---+-------------------------+-------+---------+----------+
|Discharge                |68   |76 |Admission_Discharge      |0      |Past     |1.0       |
|pleural effusion         |132  |147|Disease_Syndrome_Disorder|0      |Present  |0.9904    |
|anemia                   |171  |176|Disease_Syndrome_Disorder|0      |Present  |0.8993    |
|ascites                  |179  |185|Disease_Syndrome_Disorder|0      |Present  |0.9992    |
|esophageal reflux        |188  |204|Disease_Syndrome_Disorder|0      |Present  |1.0       |
|deep venous thrombosis   |222  |243|Disease_Syndrome_Disorder|0      |Past     |1.0       |
|Pleural effusion         |340  |355|Disease_Syndrome_Disorder|2      |Present  |1.0       |
|anemia                   |379  |384|Disease_Syndrome_Disorder|2      

## Pretrained `assertion_dl_radiology` model

In [0]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetectorDLModel\
    .pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model for radiology
radiology_ner = medical.NerModel.pretrained("ner_radiology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["ImagingFindings"])

# Assertion model trained on radiology dataset
# coming from sparknlp_jsl.annotator !!

radiology_assertion = medical.AssertionDLModel.pretrained("assertion_dl_radiology", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    radiology_ner,
    ner_converter,
    radiology_assertion
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
radiologyAssertion_model = nlpPipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[ | ][OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ][OK!]
ner_radiology download started this may take some time.
[ | ][ / ][OK!]
assertion_dl_radiology download started this may take some time.
[ | ][OK!]


In [0]:
# A sample text from a radiology report

text = """No right-sided pleural effusion or pneumothorax is definitively seen and there are mildly displaced fractures of the left lateral 8th and likely 9th ribs."""

In [0]:
data = spark.createDataFrame([[text]]).toDF("text")

In [0]:
result = radiologyAssertion_model.transform(data)

In [0]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata, 
                                     result.assertion.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['sentence']").alias("sent_id"),
              F.expr("cols['2']").alias("assertion")).show(truncate=False)

+-------------------+---------------+-------+---------+
|chunk              |ner_label      |sent_id|assertion|
+-------------------+---------------+-------+---------+
|effusion           |ImagingFindings|0      |Negative |
|pneumothorax       |ImagingFindings|0      |Negative |
|displaced fractures|ImagingFindings|0      |Confirmed|
+-------------------+---------------+-------+---------+



## Writing a generic Assertion + NER function

In [0]:
def get_base_pipeline (embeddings = 'embeddings_clinical'):

    documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

  # Sentence Detector annotator, processes various sentences per line
    sentenceDetector = nlp.SentenceDetector()\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

  # Tokenizer splits words in a relevant format for NLP
    tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

  # Clinical word embeddings trained on PubMED dataset
    word_embeddings = nlp.WordEmbeddingsModel.pretrained(embeddings, "en", "clinical/models")\
        .setInputCols(["sentence", "token"])\
        .setOutputCol("embeddings")

    base_pipeline = nlp.Pipeline(stages=[
                        documentAssembler,
                        sentenceDetector,
                        tokenizer,
                        word_embeddings])

    return base_pipeline



def get_clinical_assertion (embeddings, spark_df, nrows = 100, ner_model_name = 'ner_clinical', assertion_model_name="assertion_dl"):

  # NER model trained on i2b2 (sampled from MIMIC) dataset
    loaded_ner_model = medical.NerModel.pretrained(ner_model_name, "en", "clinical/models") \
        .setInputCols(["sentence", "token", "embeddings"]) \
        .setOutputCol("ner")

    ner_converter = nlp.NerConverter() \
        .setInputCols(["sentence", "token", "ner"]) \
        .setOutputCol("ner_chunk")

  # Assertion model trained on i2b2 (sampled from MIMIC) dataset
  # coming from sparknlp_jsl.annotator !!
    clinical_assertion = medical.AssertionDLModel.pretrained(assertion_model_name, "en", "clinical/models") \
        .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
        .setOutputCol("assertion")
      

    base_model = get_base_pipeline (embeddings)

    nlpPipeline = nlp.Pipeline(stages=[
        base_model,
        loaded_ner_model,
        ner_converter,
        clinical_assertion])

    empty_data = spark.createDataFrame([[""]]).toDF("text")

    model = nlpPipeline.fit(empty_data)

    result = model.transform(spark_df.limit(nrows))

    result = result.withColumn("id", F.monotonically_increasing_id())

    result_df = result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                                     result.ner_chunk.metadata, 
                                                     result.assertion.result,
                                                     result.assertion.metadata)).alias("cols")) \
                      .select(F.expr("cols['0']").alias("chunk"),
                              F.expr("cols['1']['entity']").alias("ner_label"),
                              F.expr("cols['2']").alias("assertion"),
                              F.expr("cols['3']['confidence']").alias("confidence"))\
                      .filter("ner_label!='O'")

    return result_df

In [0]:
embeddings = 'embeddings_clinical'

ner_model_name = 'ner_clinical_large'

nrows = 100

ner_df = get_clinical_assertion (embeddings, mt_samples_df, nrows, ner_model_name)

ner_df.show(30,truncate=50)

ner_clinical_large download started this may take some time.
[ | ][ / ][OK!]
assertion_dl download started this may take some time.
[ | ][OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ][OK!]
+----------------------------+---------+---------+----------+
|                       chunk|ner_label|assertion|confidence|
+----------------------------+---------+---------+----------+
|                Mesothelioma|  PROBLEM|  present|    0.9996|
|                Mesothelioma|  PROBLEM|  present|    0.9996|
|            pleural effusion|  PROBLEM|  present|    0.9997|
|         atrial fibrillation|  PROBLEM|  present|    0.9998|
|                      anemia|  PROBLEM|  present|    0.9997|
|                     ascites|  PROBLEM|  present|    0.9997|
|           esophageal reflux|  PROBLEM|  present|    0.9998|
|      deep venous thrombosis|  PROBLEM|  present|    0.9998|
|                Mesothelioma|  PROBLEM|  present|    0.999

In [0]:
embeddings = 'embeddings_clinical'

ner_model_name = 'ner_posology'

nrows = 100

ner_df = get_clinical_assertion (embeddings, mt_samples_df, nrows, ner_model_name)

ner_df.show()

ner_posology download started this may take some time.
[ | ][ / ][OK!]
assertion_dl download started this may take some time.
[ | ][OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ][OK!]
+----------------+---------+------------+----------+
|           chunk|ner_label|   assertion|confidence|
+----------------+---------+------------+----------+
|        Coumadin|     DRUG|hypothetical|    0.8709|
|            1 mg| STRENGTH| conditional|    0.7772|
|           daily|FREQUENCY| conditional|    0.5086|
|      Amiodarone|     DRUG|hypothetical|    0.8589|
|          100 mg| STRENGTH|hypothetical|    0.6143|
|             p.o|    ROUTE|hypothetical|    0.7991|
|           daily|FREQUENCY|     present|    0.9074|
|        Coumadin|     DRUG|     present|    0.9997|
|         Lovenox|     DRUG|     present|    0.9994|
|           40 mg| STRENGTH|     present|    0.9982|
|  subcutaneously|    ROUTE|     present|    0.9871|
|    

In [0]:
embeddings = 'embeddings_clinical'

ner_model_name = 'ner_posology_greedy'

entry_data = spark.createDataFrame([["The patient did not take a capsule of Advil."]]).toDF("text")

ner_df = get_clinical_assertion (embeddings, entry_data, nrows, ner_model_name)

ner_df.show()

ner_posology_greedy download started this may take some time.
[ | ][ / ][OK!]
assertion_dl download started this may take some time.
[ | ][OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ][OK!]
+----------------+---------+---------+----------+
|           chunk|ner_label|assertion|confidence|
+----------------+---------+---------+----------+
|capsule of Advil|     DRUG|   absent|    0.9855|
+----------------+---------+---------+----------+



In [0]:
embeddings = 'embeddings_clinical'

ner_model_name = 'ner_clinical'

entry_data = spark.createDataFrame([["The patient has no fever"]]).toDF("text")

ner_df = get_clinical_assertion (embeddings, entry_data, nrows, ner_model_name)

ner_df.show()

ner_clinical download started this may take some time.
[ | ][ / ][OK!]
assertion_dl download started this may take some time.
[ | ][OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ][OK!]
+-----+---------+---------+----------+
|chunk|ner_label|assertion|confidence|
+-----+---------+---------+----------+
|fever|  PROBLEM|   absent|     0.998|
+-----+---------+---------+----------+



In [0]:
import pandas as pd

def get_clinical_assertion_light (light_model, text):

  light_result = light_model.fullAnnotate(text)[0]

  chunks=[]
  entities=[]
  status=[]

  for n,m in zip(light_result['ner_chunk'],light_result['assertion']):
      
      chunks.append(n.result)
      entities.append(n.metadata['entity']) 
      status.append(m.result)
          
  df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})

  return df

In [0]:
clinical_text = """
Patient with severe fever and sore throat. 
He shows no stomach pain and he maintained on an epidural and PCA for pain control.
He also became short of breath with climbing a flight of stairs.
After CT, lung tumor located at the right lower lobe. Father with Alzheimer.
"""

light_model = nlp.LightPipeline(model)

get_clinical_assertion_light (light_model, clinical_text)

# cols = [
#      'entities_ner_chunk',
#      'entities_ner_chunk_class', 
#      'assertion',
#      'assertion_confidence']
#      
# df = nlp.nlu.to_pretty_df(light_model,clinical_text, output_level='chunk')[cols]
# df


Unnamed: 0,chunks,entities,assertion
0,fever,VS_Finding,Present
1,sore throat,Symptom,Present
2,stomach pain,Symptom,Absent
3,pain,Symptom,Hypothetical
4,short of breath,Symptom,Present
5,climbing a flight of stairs,Symptom,Present
6,Alzheimer,Disease_Syndrome_Disorder,Family


# Oncological Assertion Models

Oncology Assertion Models

|    | model_name              |Predicted Entities|
|---:|:------------------------|-|
| 1 | [assertion_oncology_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_wip_en.html) | Medical_History, Family_History, Possible, Hypothetical_Or_Absent|
| 2 | [assertion_oncology_problem_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_problem_wip_en.html) |Present, Possible, Hypothetical, Absent, Family|
| 3 | [assertion_oncology_treatment_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_treatment_binary_wip_en.html) |Present, Planned, Past, Hypothetical, Absent|
| 3 | [assertion_oncology_treatment_wip]() |Present, Planned, Past, Hypothetical, Absent|
| 4 | [assertion_oncology_response_to_treatment_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_response_to_treatment_wip_en.html) |Present_Or_Past, Hypothetical_Or_Absent|
| 5 | [assertion_oncology_test_binary_wip](https://nlp.johnsnowlabs.com/2022/10/01/assertion_oncology_test_binary_wip_en.html) |Present_Or_Past, Hypothetical_Or_Absent|
| 6 | [assertion_oncology_smoking_status_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_smoking_status_wip_en.html) |Absent, Past, Present|
| 7 | [assertion_oncology_family_history_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_family_history_wip_en.html) |Family_History, Other|
| 8 | [assertion_oncology_demographic_binary_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_demographic_binary_wip_en.html) |Patient, Someone_Else|

In [0]:
embeddings = 'embeddings_clinical'

ner_model_name = 'ner_oncology_wip'

assertion_model_name='assertion_oncology_wip'

nrows = 100

ner_df = get_clinical_assertion (embeddings, mt_samples_df, nrows, ner_model_name,assertion_model_name )

ner_df.show(truncate = False)

ner_oncology_wip download started this may take some time.
[ | ][ / ][OK!]
assertion_oncology_wip download started this may take some time.
[ | ][OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ][OK!]
+------------------------+--------------------+------------+----------+
|chunk                   |ner_label           |assertion   |confidence|
+------------------------+--------------------+------------+----------+
|Mesothelioma            |Cancer_Dx           |Present     |0.9885    |
|Mesothelioma            |Cancer_Dx           |Hypothetical|0.981     |
|August 24, 2007         |Date                |Past        |0.9726    |
|decortication           |Cancer_Surgery      |Past        |0.994     |
|lung                    |Site_Lung           |Past        |0.9453    |
|pleural                 |Site_Other_Body_Part|Past        |0.9624    |
|biopsy                  |Pathology_Test      |Past        |0.9979    |
|transpleural

# Assertion Filterer
AssertionFilterer will allow you to filter out the named entities by the list of acceptable assertion statuses. This annotator would be quite handy if you want to set a white list for the acceptable assertion statuses like present or conditional; and do not want absent conditions get out of your pipeline.

In [0]:
clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\
    #.setIncludeAllConfidenceScores(False)

ner_converter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

clinical_assertion = medical.AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence","ner_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setWhiteList(["Present"])

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      clinical_assertion,
      assertion_filterer
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
assertionFilter_model = nlpPipeline.fit(empty_data)

ner_clinical download started this may take some time.
[ | ][OK!]
assertion_jsl_augmented download started this may take some time.
[ | ][OK!]


In [0]:
text = 'Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. Alopecia noted. She denies pain.'

light_model = nlp.LightPipeline(assertionFilter_model)
light_result = light_model.annotate(text)

light_result.keys()

Out[37]: dict_keys(['assertion_filtered', 'document', 'ner_chunk', 'assertion', 'token', 'ner', 'embeddings', 'sentence'])

In [0]:
list(zip(light_result['ner_chunk'], light_result['assertion']))

Out[38]: [('a headache', 'Present'),
 ('a head CT', 'Planned'),
 ('anxious', 'Possible'),
 ('Alopecia', 'Hypothetical'),
 ('pain', 'Absent')]

In [0]:
assertion_filterer.getWhiteList()

Out[39]: ['Present']

In [0]:
chunks=[]
entities=[]
status=[]
confidence=[]

light_result = light_model.fullAnnotate(text)[0]

for m in light_result['assertion_filtered']:
    
    chunks.append(m.result)
    entities.append(m.metadata['entity']) 
    status.append(m.metadata['assertion'])
    confidence.append(m.metadata['confidence'])
        
df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status, 'confidence':confidence})

df

Unnamed: 0,chunks,entities,assertion,confidence
0,a headache,PROBLEM,Present,0.9721


As you see, there is no "pain" chunk since it has "absent" assertion label.

# Train a custom Assertion Model

**WARNING:** For training an Assertion model, please use TensorFlow version 2.11

In [0]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/i2b2_assertion_sample_short.csv -P /dbfs/

In [0]:
assertion_df = spark.read.option("header", True).option("inferSchema", "True").csv("/i2b2_assertion_sample_short.csv")


In [0]:
assertion_df.show(3, truncate=100)

+-------------------------------------------------+-------------------+-------+-----+---+
|                                             text|             target|  label|start|end|
+-------------------------------------------------+-------------------+-------+-----+---+
|She has no history of liver disease , hepatitis .|      liver disease| absent|    5|  6|
|                         1. Undesired fertility .|undesired fertility|present|    1|  2|
|                            3) STATUS POST FALL .|               fall|present|    3|  3|
+-------------------------------------------------+-------------------+-------+-----+---+
only showing top 3 rows



In [0]:
(training_data, test_data) = assertion_df.randomSplit([0.8, 0.2], seed = 100)
print("Training Dataset Count: " + str(training_data.count()))
print("Test Dataset Count: " + str(test_data.count()))

Training Dataset Count: 721
Test Dataset Count: 170


In [0]:
training_data.groupBy('label').count().orderBy('count', ascending=False).show(truncate=False)


+-------+-----+
|label  |count|
+-------+-----+
|present|546  |
|absent |175  |
+-------+-----+



In [0]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("chunk")\
    .setChunkCol("target")\
    .setStartCol("start")\
    .setStartColByTokenIndex(True)\
    .setFailOnMissing(False)\
    .setLowerCase(True)

token = nlp.Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["document", "token"])\
      .setOutputCol("embeddings")


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ][OK!]


We will transform our test data with a pipeline consisting of same steps with the pipeline which contains AssertionDLApproach.
By doing this, we enable that test data will have same columns with training data in AssertionDLApproach. <br/>
The goal of this implementation is enabling the usage of `setTestDataset()` parameter in AssertionDLApproach.

In [0]:
clinical_assertion_pipeline = nlp.Pipeline(
    stages = [
    document,
    chunk,
    token,
    embeddings])

assertion_test_data = clinical_assertion_pipeline.fit(test_data).transform(test_data)

In [0]:
assertion_test_data.columns

Out[48]: ['text',
 'target',
 'label',
 'start',
 'end',
 'document',
 'chunk',
 'token',
 'embeddings']

We save the test data in parquet format to use in `AssertionDLApproach()`.

In [0]:
assertion_test_data.write.mode("overwrite").parquet('i2b2_assertion_sample_test_data.parquet')

#### Graph Setup

In [0]:
!pip install -q tensorflow==2.11.0
!pip install -q tensorflow-addons

You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-9ff7b095-8dd9-477a-903e-8df1bfe8812e/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-9ff7b095-8dd9-477a-903e-8df1bfe8812e/bin/python -m pip install --upgrade pip' command.[0m


In [0]:
%fs mkdirs file:/dbfs/tf_graphs

In [0]:
assertion_graph_builder = medical.TFGraphBuilder()\
    .setModelName("assertion_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFolder('file:/dbfs/tf_graphs')\
    .setGraphFile("assertion_graph.pb")\
    .setMaxSequenceLength(250)\
    .setHiddenUnitsNumber(25)

In [0]:
%fs mkdirs file:/dbfs/assertion_logs

In [0]:
 # %fs mkdirs file:/dbfs/assertion_tf_graphs
 # %fs mkdirs file:/dbfs/assertion_logs

# if you want you can use existing graph

# !wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/tf_graphs/blstm_34_32_30_200_2.pb -P /dbfs/assertion_tf_graphs

In [0]:
# Create custom graph


# tf_graph.print_model_params("assertion_dl")

# feat_size = 200
# n_classes = 2

# tf_graph.build("assertion_dl",
#                build_params={"n_classes": n_classes},
#                model_location= "/dbfs/assertion_tf_graphs", 
#                model_filename="blstm_34_32_30_{}_{}.pb".format(feat_size, n_classes))

**Setting the Scope Window (Target Area) Dynamically in Assertion Status Detection Models**


This parameter allows you to train the Assertion Status Models to focus on specific context windows when resolving the status of a NER chunk. The window is in format `[X,Y]` being `X` the number of tokens to consider on the left of the chunk, and `Y` the max number of tokens to consider on the right. Let’s take a look at what different windows mean:


*   By default, the window is `[-1,-1]` which means that the Assertion Status will look at all of the tokens in the sentence/document (up to a maximum of tokens set in `setMaxSentLen()` ).
*   `[0,0]` means “don’t pay attention to any token except the ner_chunk”, what basically is not considering any context for the Assertion resolution.
*   `[9,15]` is what empirically seems to be the best baseline, meaning that we look up to 9 tokens on the left and 15 on the right of the ner chunk to understand the context and resolve the status.


Check this [Scope Window Tuning Assertion Status Detection notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.1.Scope_window_tuning_assertion_status_detection.ipynb)  that illustrates the effect of the different windows and how to properly fine-tune your AssertionDLModels to get the best of them.

In our case, the best Scope Window is around [10,10]

In [0]:
scope_window = [10,10]

assertionStatus = medical.AssertionDLApproach()\
    .setLabelCol("label")\
    .setInputCols("document", "chunk", "embeddings")\
    .setOutputCol("assertion")\
    .setBatchSize(128)\
    .setDropout(0.1)\
    .setLearningRate(0.001)\
    .setEpochs(50)\
    .setValidationSplit(0.2)\
    .setStartCol("start")\
    .setEndCol("end")\
    .setMaxSentLen(250)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath('dbfs:/assertion_logs')\
    .setGraphFolder('dbfs:/tf_graphs')\
    .setGraphFile("file:/dbfs/tf_graphs/assertion_graph.pb")\
    .setTestDataset(path="/i2b2_assertion_sample_test_data.parquet", read_as='SPARK', options={'format': 'parquet'})\
    .setScopeWindow(scope_window)


'''
If .setTestDataset parameter is employed, raw test data cannot be fitted. .setTestDataset only works for dataframes which are correctly transformed
by a pipeline consisting of document, chunk, embeddings stages.
'''

Out[54]: '\nIf .setTestDataset parameter is employed, raw test data cannot be fitted. .setTestDataset only works for dataframes which are correctly transformed\nby a pipeline consisting of document, chunk, embeddings stages.\n'

In [0]:
clinical_assertion_pipeline = nlp.Pipeline(
    stages = [
    document,
    chunk,
    token,
    embeddings,
    assertion_graph_builder,
    assertionStatus])

In [0]:
assertion_model = clinical_assertion_pipeline.fit(training_data)

TF Graph Builder configuration:
Model name: assertion_dl
Graph folder: file:/dbfs/tf_graphs
Graph file name: assertion_graph.pb
Build params: {'n_classes': 2, 'feat_size': 200, 'max_seq_len': 250, 'n_hidden': 25}


Instructions for updating:
non-resource variables are not supported in the long term


Device mapping: no known devices.


Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Device mapping: no known devices.
assertion_dl graph exported to file:/dbfs/tf_graphs/assertion_graph.pb


In [0]:
assertion_model.stages

Out[57]: [DocumentAssembler_97deca732a00,
 Doc2Chunk_be737b938be7,
 REGEX_TOKENIZER_98ec707dd083,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 TFGraphBuilderModel_546223903206,
 ASSERTION_DL_4042af36b1a9]

## Checking the results

Checking the results saved in the log file

In [0]:
preds = assertion_model.transform(test_data).select('label','assertion.result')

preds.show()

+-------+---------+
|  label|   result|
+-------+---------+
|present|[present]|
| absent|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
+-------+---------+
only showing top 20 rows



In [0]:
preds_df = preds.toPandas()

In [0]:
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])


In [0]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report(preds_df['result'], preds_df['label']))

              precision    recall  f1-score   support

      absent       0.74      0.91      0.81        43
     present       0.97      0.89      0.93       127

    accuracy                           0.89       170
   macro avg       0.85      0.90      0.87       170
weighted avg       0.91      0.89      0.90       170



In [0]:
#saving the model that we've trained
assertion_model.stages[-1].write().overwrite().save('/databricks/driver/models/custom_assertion_model')

### Load saved model

In [0]:
clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

clinical_assertion = medical.AssertionDLModel.load("/databricks/driver/models/custom_assertion_model") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
    
nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    clinical_assertion
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)


ner_clinical download started this may take some time.
[ | ][OK!]


In [0]:
text = 'Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. No alopecia and pain noted'


light_model = nlp.LightPipeline(model)

light_result = light_model.fullAnnotate(text)[0]

print(text)

chunks=[]
entities=[]
status=[]
confidence=[]

for n,m in zip(light_result['ner_chunk'],light_result['assertion']):
    
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    status.append(m.result)
    confidence.append(m.metadata['confidence'])
        
df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status, 'confidence':confidence})

df

Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. No alopecia and pain noted


Unnamed: 0,chunks,entities,assertion,confidence
0,a headache,PROBLEM,present,1.0
1,a head CT,TEST,present,1.0
2,anxious,PROBLEM,present,0.9998
3,alopecia,PROBLEM,absent,0.9751
4,pain,PROBLEM,absent,0.9855
