![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb)

# Clinical Assertion Status Model 


The deep neural network architecture for assertion status detection in Spark NLP is based on a Bi-LSTM framework, and is a modified version of the architecture proposed by Federico Fancellu, Adam Lopez and Bonnie Webber ([Neural Networks For Negation Scope Detection](https://aclanthology.org/P16-1047.pdf)). Its goal is to classify the assertions made on given medical concepts as being present, absent, or possible in the patient, conditionally present in the patient under certain circumstances,
hypothetically present in the patient at some future point, and
mentioned in the patient report but associated with someoneelse.
In the proposed implementation, input units depend on the
target tokens (a named entity) and the neighboring words that
are explicitly encoded as a sequence using word embeddings.
Similar to paper mentioned above,  it is observed that that 95% of the scope tokens (neighboring words) fall in a window of 9 tokens to the left and 15
to the right of the target tokens in the same dataset. Therefore, the same window size was implemented and it following parameters were used: learning
rate 0.0012, dropout 0.05, batch size 64 and a maximum sentence length 250. The model has been implemented within
Spark NLP as an annotator called AssertionDLModel. After
training 20 epoch and measuring accuracy on the official test
set, this implementation exceeds the latest state-of-the-art
accuracy benchmarks as summarized as following table:

|Assertion Label|Spark NLP|Latest Best|
|-|-|-|
|Absent       |0.944 |0.937|
|Someone-else |0.904|0.869|
|Conditional  |0.441|0.422|
|Hypothetical |0.862|0.890|
|Possible     |0.680|0.630|
|Present      |0.953|0.957|
|micro F1     |0.939|0.934|


**Colab Setup**

In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing NLU
! pip install --upgrade --q nlu==4.0.1rc4 --no-dependencies

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [None]:
import json
import os

from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession

import nlu
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

import pandas as pd

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 4.3.2
Spark NLP_JSL Version : 4.3.2


In [None]:
# if you want to start the session with custom params as in start function above
from pyspark.sql import SparkSession

def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+PUBLIC_VERSION) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+SECRET+"/spark-nlp-jsl-"+JSL_VERSION+".jar")
      
    return builder.getOrCreate()

#spark = start(SECRET)

# Clinical Assertion Models (with pretrained models)

|    | model_name              |Predicted Entities|
|---:|:------------------------|-|
|  1 | [assertion_dl](https://nlp.johnsnowlabs.com/2021/01/26/assertion_dl_en.html)            |Present, Absent, Possible, conditional, hypothetical, associated_with_someone_else|
|  2 | [assertion_dl_biobert](https://nlp.johnsnowlabs.com/2021/01/26/assertion_dl_biobert_en.html)    |Present, Absent, Possible, conditional, hypothetical, associated_with_someone_else|
|  3 | [assertion_dl_healthcare](https://nlp.johnsnowlabs.com/2020/09/23/assertion_dl_healthcare_en.html) |Present, Absent, Possible, conditional, hypothetical, associated_with_someone_else|
|  4 | [assertion_dl_large](https://nlp.johnsnowlabs.com/2020/05/21/assertion_dl_large_en.html)      |Present, Absent, Possible, conditional, hypothetical, associated_with_someone_else|
|  5 | [assertion_dl_radiology](https://nlp.johnsnowlabs.com/2021/03/18/assertion_dl_radiology_en.html)   |Confirmed, Suspected, Negative|
|  6 | [assertion_jsl](https://nlp.johnsnowlabs.com/2021/07/24/assertion_jsl_en.html)           |Present, Absent, Possible, Planned, Someoneelse, Past, Family, Hypotetical|
|  7 | [assertion_jsl_large](https://nlp.johnsnowlabs.com/2021/07/24/assertion_jsl_large_en.html)     |present, absent, possible, planned, someoneelse, past, hypothetical|
|  8 |  [assertion_ml](https://nlp.johnsnowlabs.com/2020/01/30/assertion_ml_en.html) |Hypothetical, Present, Absent, Possible, Conditional, Associated_with_someone_else|
|  9 | [assertion_dl_scope_L10R10](https://nlp.johnsnowlabs.com/2022/03/17/assertion_dl_scope_L10R10_en_3_0.html)| hypothetical, associated_with_someone_else, conditional, possible, absent, present|
| 10 | [assertion_dl_biobert_scope_L10R10](https://nlp.johnsnowlabs.com/2022/03/24/assertion_dl_biobert_scope_L10R10_en_2_4.html)| hypothetical, associated_with_someone_else, conditional, possible, absent, present|
| 11 | [assertion_jsl_augmented](https://nlp.johnsnowlabs.com/2022/09/15/assertion_jsl_augmented_en.html)| Present, Absent, Possible, Planned, Past, Family, Hypotetical, SomeoneElse|






### Oncology Assertion Models

|    | model_name              |Predicted Entities|
|---:|:------------------------|-|
| 1 | [assertion_oncology_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_wip_en.html) | Medical_History, Family_History, Possible, Hypothetical_Or_Absent|
| 2 | [assertion_oncology_problem_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_problem_wip_en.html) |Present, Possible, Hypothetical, Absent, Family|
| 3 | [assertion_oncology_treatment_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_treatment_binary_wip_en.html) |Present, Planned, Past, Hypothetical, Absent|
| 3 | [assertion_oncology_treatment_wip]() |Present, Planned, Past, Hypothetical, Absent|
| 4 | [assertion_oncology_response_to_treatment_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_response_to_treatment_wip_en.html) |Present_Or_Past, Hypothetical_Or_Absent|
| 5 | [assertion_oncology_test_binary_wip](https://nlp.johnsnowlabs.com/2022/10/01/assertion_oncology_test_binary_wip_en.html) |Present_Or_Past, Hypothetical_Or_Absent|
| 6 | [assertion_oncology_smoking_status_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_smoking_status_wip_en.html) |Absent, Past, Present|
| 7 | [assertion_oncology_family_history_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_family_history_wip_en.html) |Family_History, Other|
| 8 | [assertion_oncology_demographic_binary_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_demographic_binary_wip_en.html) |Patient, Someone_Else|

### Pretrained `assertion_jsl_augmented` model

In [None]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\
    #.setIncludeAllConfidenceScores(False)

ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["SYMPTOM","VS_FINDING","DISEASE_SYNDROME_DISORDER","ADMISSION_DISCHARGE","PROCEDURE"])

# Assertion model trained on i2b2 (sampled from MIMIC) dataset
clinical_assertion = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
    
nlpPipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    clinical_assertion
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]
assertion_jsl_augmented download started this may take some time.
[OK!]


In [None]:
AssertionDLApproach().extractParamMap()

{Param(parent='AssertionDLApproach_82aa36b1741d', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='AssertionDLApproach_82aa36b1741d', name='label', doc='Column with one label per document'): 'label',
 Param(parent='AssertionDLApproach_82aa36b1741d', name='batchSize', doc='Size for each batch in the optimization process'): 64,
 Param(parent='AssertionDLApproach_82aa36b1741d', name='epochs', doc='Number of epochs for the optimization process'): 5,
 Param(parent='AssertionDLApproach_82aa36b1741d', name='learningRate', doc='Learning rate for the optimization process'): 0.0012,
 Param(parent='AssertionDLApproach_82aa36b1741d', name='dropout', doc='Dropout at the output of each layer'): 0.05,
 Param(parent='AssertionDLApproach_82aa36b1741d', name='maxSentLen', doc='Max length for an input sentence.'): 250,
 Param(parent='AssertionDLApproach_82aa36b1741d', name='includeConfidence', doc='whether to include confidence scores in a

In [None]:
# we also have a LogReg based Assertion Model.
'''
clinical_assertion_ml = AssertionLogRegModel.pretrained("assertion_ml", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
'''

In [None]:
text = """
GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
ABDOMEN: Soft and nontender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
EXTREMITIES: There is some edema, but no cyanosis and clubbing .
IMPRESSION: At this time is refractory anemia, which is transfusion dependent. He is on B12, iron, folic acid, and Procrit. There are no sign or symptom of blood loss and a recent esophagogastroduodenoscopy, which was negative. His creatinine was 1. 
  My impression at this time is that he probably has an underlying myelodysplastic syndrome or bone marrow failure. His creatinine on this hospitalization was up slightly to 1.6 and this may contribute to his anemia.
  At this time, my recommendation for the patient is that he undergoes further serologic evaluation with reticulocyte count, serum protein, and electrophoresis, LDH, B12, folate, erythropoietin level, and he should undergo a bone marrow aspiration and biopsy. 
  I have discussed the procedure in detail which the patient. I have discussed the risks, benefits, and successes of that treatment and usefulness of the bone marrow and predicting his cause of refractory anemia and further therapeutic interventions, which might be beneficial to him. 
  He is willing to proceed with the studies I have described to him. We will order an ultrasound of his abdomen because of the possible fullness of the spleen, and I will probably see him in follow up after this hospitalization.
  As always, we greatly appreciate being able to participate in the care of your patient. We appreciate the consultation of the patient. 
"""

In [None]:
light_model = LightPipeline(model)

light_result = light_model.fullAnnotate(text)[0]

chunks=[]
entities=[]
status=[]
confidence=[]

for n,m in zip(light_result['ner_chunk'],light_result['assertion']):
    
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    status.append(m.result)
    confidence.append(m.metadata['confidence'])
        
df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status, 'confidence':confidence})

df

Unnamed: 0,chunks,entities,assertion,confidence
0,distress,Symptom,Absent,1.0
1,jugular venous pressure distention,Symptom,Absent,1.0
2,adenopathy,Symptom,Absent,1.0
3,nontender,Symptom,Absent,1.0
4,fullness,Symptom,Possible,0.9999
5,edema,Symptom,Present,1.0
6,cyanosis,VS_Finding,Absent,1.0
7,clubbing,Symptom,Absent,1.0
8,anemia,Disease_Syndrome_Disorder,Hypothetical,0.9758
9,blood loss,Symptom,Absent,1.0


In [None]:
light_model = LightPipeline(model)

light_result = light_model.fullAnnotate(text)[0]

from sparknlp_display import AssertionVisualizer

vis = AssertionVisualizer()

vis.set_label_colors({'TEST':'#008080', 'PROBLEM':'#800080'})

vis.display(light_result, 'ner_chunk', 'assertion')

In [None]:
nlu.to_pretty_df(model,text,output_level='chunk').columns



Index(['assertion', 'assertion_confidence', 'document', 'entities_ner_chunk',
       'entities_ner_chunk_class', 'entities_ner_chunk_confidence',
       'entities_ner_chunk_origin_chunk', 'entities_ner_chunk_origin_sentence',
       'sentence_pragmatic', 'word_embedding_embeddings'],
      dtype='object')

In [None]:
cols = [
     'entities_ner_chunk',
     'entities_ner_chunk_class', 
     'assertion',
     'assertion_confidence']
     
df = nlu.to_pretty_df(model,text,output_level='chunk')[cols].reset_index(drop=True)
df




Unnamed: 0,entities_ner_chunk,entities_ner_chunk_class,assertion,assertion_confidence
0,distress,Symptom,Absent,1.0
1,jugular venous pressure distention,Symptom,Absent,1.0
2,adenopathy,Symptom,Absent,1.0
3,nontender,Symptom,Absent,1.0
4,fullness,Symptom,Possible,0.9999
5,edema,Symptom,Present,1.0
6,cyanosis,VS_Finding,Absent,1.0
7,clubbing,Symptom,Absent,1.0
8,anemia,Disease_Syndrome_Disorder,Hypothetical,0.9758
9,blood loss,Symptom,Absent,1.0


In [None]:
# Downloading sample datasets.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/mt_samples_10.csv

In [None]:
mt_samples_df = spark.createDataFrame(pd.read_csv("/content/mt_samples_10.csv", sep=',', index_col=["index"]).reset_index())
                
mt_samples_df.printSchema()

root
 |-- index: long (nullable = true)
 |-- text: string (nullable = true)



In [None]:
mt_samples_df.show()

+-----+--------------------+
|index|                text|
+-----+--------------------+
|    0|Sample Type / Med...|
|    1|Sample Type / Med...|
|    2|Sample Type / Med...|
|    3|Sample Type / Med...|
|    4|Sample Type / Med...|
|    5|Sample Type / Med...|
|    6|Sample Type / Med...|
|    7|Sample Type / Med...|
|    8|Sample Type / Med...|
|    9|Sample Type / Med...|
+-----+--------------------+



In [None]:
result = model.transform(mt_samples_df)

In [None]:
result.show()

+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|index|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|           assertion|
+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|    0|Sample Type / Med...|[{document, 0, 54...|[{document, 0, 24...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 68, 76, ...|[{assertion, 68, ...|
|    1|Sample Type / Med...|[{document, 0, 32...|[{document, 0, 26...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 68, 92, ...|[{assertion, 68, ...|
|    2|Sample Type / Med...|[{document, 0, 42...|[{document, 0, 14...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_

In [None]:
result.select('sentence.result').take(1)

[Row(result=['Sample Type / Medical Specialty:\nHematology - Oncology\nSample Name:\nDischarge Summary - Mesothelioma - 1\nDescription:\nMesothelioma, pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.', '(Medical Transcription Sample Report)\nPRINCIPAL DIAGNOSIS:\nMesothelioma.', 'SECONDARY DIAGNOSES:\nPleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.', 'PROCEDURES', '1. On August 24, 2007, decortication of the lung with pleural biopsy and transpleural fluoroscopy.', '2. On August 20, 2007, thoracentesis.', '3. On August 31, 2007, Port-A-Cath placement.', 'HISTORY AND PHYSICAL:\nThe patient is a 41-year-old Vietnamese female with a nonproductive cough that started last week.', 'She has had right-sided chest pain radiating to her back with fever starting yesterday.', 'She has a history of pericarditis and pericardectomy in May 2006 and developed cough with righ

In [None]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,  
                                     result.ner_chunk.begin, 
                                     result.ner_chunk.end, 
                                     result.ner_chunk.metadata, 
                                     result.assertion.result,
                                     result.assertion.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['3']['sentence']").alias("sent_id"),
              F.expr("cols['4']").alias("assertion"),
              F.expr("cols['5']['confidence']").alias("confidence") ).show(truncate=False)

+-------------------------+-----+---+-------------------------+-------+---------+----------+
|chunk                    |begin|end|ner_label                |sent_id|assertion|confidence|
+-------------------------+-----+---+-------------------------+-------+---------+----------+
|Discharge                |68   |76 |Admission_Discharge      |0      |Past     |1.0       |
|pleural effusion         |132  |147|Disease_Syndrome_Disorder|0      |Present  |0.9904    |
|anemia                   |171  |176|Disease_Syndrome_Disorder|0      |Present  |0.8993    |
|ascites                  |179  |185|Disease_Syndrome_Disorder|0      |Present  |0.9992    |
|esophageal reflux        |188  |204|Disease_Syndrome_Disorder|0      |Present  |1.0       |
|deep venous thrombosis   |222  |243|Disease_Syndrome_Disorder|0      |Past     |1.0       |
|Pleural effusion         |340  |355|Disease_Syndrome_Disorder|2      |Present  |1.0       |
|anemia                   |379  |384|Disease_Syndrome_Disorder|2      

### Pretrained `assertion_dl_radiology` model

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = SentenceDetectorDLModel\
    .pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model for radiology
radiology_ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\
    #.setIncludeAllConfidenceScores(False)

ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["ImagingFindings"])

# Assertion model trained on radiology dataset
radiology_assertion = AssertionDLModel.pretrained("assertion_dl_radiology", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

nlpPipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    radiology_ner,
    ner_converter,
    radiology_assertion
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
radiologyAssertion_model = nlpPipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_radiology download started this may take some time.
[OK!]
assertion_dl_radiology download started this may take some time.
[OK!]


In [None]:
# A sample text from a radiology report

text = """No right-sided pleural effusion or pneumothorax is definitively seen and there are mildly displaced fractures of the left lateral 8th and likely 9th ribs."""

In [None]:
data = spark.createDataFrame([[text]]).toDF("text")

In [None]:
result = radiologyAssertion_model.transform(data)

In [None]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata, 
                                     result.assertion.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['sentence']").alias("sent_id"),
              F.expr("cols['2']").alias("assertion")).show(truncate=False)

+-------------------+---------------+-------+---------+
|chunk              |ner_label      |sent_id|assertion|
+-------------------+---------------+-------+---------+
|effusion           |ImagingFindings|0      |Negative |
|pneumothorax       |ImagingFindings|0      |Negative |
|displaced fractures|ImagingFindings|0      |Confirmed|
+-------------------+---------------+-------+---------+



## Writing a generic Assertion + NER function

In [None]:
def get_base_pipeline (embeddings = 'embeddings_clinical'):

    documentAssembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

  # Sentence Detector annotator, processes various sentences per line
    sentenceDetector = SentenceDetector()\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

  # Tokenizer splits words in a relevant format for NLP
    tokenizer = Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

  # Clinical word embeddings trained on PubMED dataset
    word_embeddings = WordEmbeddingsModel.pretrained(embeddings, "en", "clinical/models")\
        .setInputCols(["sentence", "token"])\
        .setOutputCol("embeddings")

    base_pipeline = Pipeline(stages=[
                        documentAssembler,
                        sentenceDetector,
                        tokenizer,
                        word_embeddings])

    return base_pipeline



def get_clinical_assertion (embeddings, spark_df, nrows = 100, ner_model_name = 'ner_clinical', assertion_model_name="assertion_dl"):

  # NER model trained on i2b2 (sampled from MIMIC) dataset
    loaded_ner_model = MedicalNerModel.pretrained(ner_model_name, "en", "clinical/models") \
        .setInputCols(["sentence", "token", "embeddings"]) \
        .setOutputCol("ner")

    ner_converter = NerConverterInternal() \
        .setInputCols(["sentence", "token", "ner"]) \
        .setOutputCol("ner_chunk")

  # Assertion model trained on i2b2 (sampled from MIMIC) dataset
  # coming from sparknlp_jsl.annotator !!
    clinical_assertion = AssertionDLModel.pretrained(assertion_model_name, "en", "clinical/models") \
        .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
        .setOutputCol("assertion")
      

    base_model = get_base_pipeline (embeddings)

    nlpPipeline = Pipeline(stages=[
        base_model,
        loaded_ner_model,
        ner_converter,
        clinical_assertion])

    empty_data = spark.createDataFrame([[""]]).toDF("text")

    model = nlpPipeline.fit(empty_data)

    result = model.transform(spark_df.limit(nrows))

    result = result.withColumn("id", F.monotonically_increasing_id())

    result_df = result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                                     result.ner_chunk.metadata, 
                                                     result.assertion.result,
                                                     result.assertion.metadata)).alias("cols")) \
                      .select(F.expr("cols['0']").alias("chunk"),
                              F.expr("cols['1']['entity']").alias("ner_label"),
                              F.expr("cols['2']").alias("assertion"),
                              F.expr("cols['3']['confidence']").alias("confidence"))\
                      .filter("ner_label!='O'")

    return result_df

In [None]:
embeddings = 'embeddings_clinical'

ner_model_name = 'ner_clinical_large'

nrows = 100

ner_df = get_clinical_assertion (embeddings, mt_samples_df, nrows, ner_model_name)

ner_df.show(30,truncate=50)

ner_clinical_large download started this may take some time.
[OK!]
assertion_dl download started this may take some time.
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
+----------------------------+---------+---------+----------+
|                       chunk|ner_label|assertion|confidence|
+----------------------------+---------+---------+----------+
|                Mesothelioma|  PROBLEM|  present|    0.9996|
|                Mesothelioma|  PROBLEM|  present|    0.9996|
|            pleural effusion|  PROBLEM|  present|    0.9997|
|         atrial fibrillation|  PROBLEM|  present|    0.9998|
|                      anemia|  PROBLEM|  present|    0.9997|
|                     ascites|  PROBLEM|  present|    0.9997|
|           esophageal reflux|  PROBLEM|  present|    0.9998|
|      deep venous thrombosis|  PROBLEM|  present|    0.9998|
|                Mesothelioma|  PROBLEM|  present|    0.9992|
|            Pleural eff

In [None]:
embeddings = 'embeddings_clinical'

ner_model_name = 'ner_posology'

nrows = 100

ner_df = get_clinical_assertion (embeddings, mt_samples_df, nrows, ner_model_name)

ner_df.show()

ner_posology download started this may take some time.
[OK!]
assertion_dl download started this may take some time.
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
+----------------+---------+------------+----------+
|           chunk|ner_label|   assertion|confidence|
+----------------+---------+------------+----------+
|        Coumadin|     DRUG|hypothetical|    0.8709|
|            1 mg| STRENGTH| conditional|    0.7772|
|           daily|FREQUENCY| conditional|    0.5086|
|      Amiodarone|     DRUG|hypothetical|    0.8589|
|          100 mg| STRENGTH|hypothetical|    0.6143|
|             p.o|    ROUTE|hypothetical|    0.7991|
|           daily|FREQUENCY|     present|    0.9074|
|        Coumadin|     DRUG|     present|    0.9997|
|         Lovenox|     DRUG|     present|    0.9994|
|           40 mg| STRENGTH|     present|    0.9982|
|  subcutaneously|    ROUTE|     present|    0.9871|
|    chemotherapy|     DRUG|    

In [None]:
embeddings = 'embeddings_clinical'

ner_model_name = 'ner_posology_greedy'

entry_data = spark.createDataFrame([["The patient did not take a capsule of Advil."]]).toDF("text")

ner_df = get_clinical_assertion (embeddings, entry_data, nrows, ner_model_name)

ner_df.show()

ner_posology_greedy download started this may take some time.
[OK!]
assertion_dl download started this may take some time.
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
+----------------+---------+---------+----------+
|           chunk|ner_label|assertion|confidence|
+----------------+---------+---------+----------+
|capsule of Advil|     DRUG|   absent|    0.9855|
+----------------+---------+---------+----------+



In [None]:
embeddings = 'embeddings_clinical'

ner_model_name = 'ner_clinical'

entry_data = spark.createDataFrame([["The patient has no fever"]]).toDF("text")

ner_df = get_clinical_assertion (embeddings, entry_data, nrows, ner_model_name)

ner_df.show()

ner_clinical download started this may take some time.
[OK!]
assertion_dl download started this may take some time.
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
+-----+---------+---------+----------+
|chunk|ner_label|assertion|confidence|
+-----+---------+---------+----------+
|fever|  PROBLEM|   absent|     0.998|
+-----+---------+---------+----------+



In [None]:
def get_clinical_assertion_light (light_model, text):

  light_result = light_model.fullAnnotate(text)[0]

  chunks=[]
  entities=[]
  status=[]
  confidence=[]

  for n,m in zip(light_result['ner_chunk'],light_result['assertion']):
      
      chunks.append(n.result)
      entities.append(n.metadata['entity']) 
      status.append(m.result)
      confidence.append(m.metadata['confidence'])
          
  df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status,'confidence':confidence})

  return df

In [None]:
clinical_text = """
Patient with severe fever and sore throat. 
He shows no stomach pain and he maintained on an epidural and PCA for pain control.
He also became short of breath with climbing a flight of stairs.
After CT, lung tumor located at the right lower lobe. Father with Alzheimer.
"""

light_model = LightPipeline(model)

# get_clinical_assertion_light (light_model, clinical_text)

cols = [
     'entities_ner_chunk',
     'entities_ner_chunk_class', 
     'assertion',
     'assertion_confidence']
     
df = nlu.to_pretty_df(light_model,clinical_text, output_level='chunk')[cols]
df



Unnamed: 0,entities_ner_chunk,entities_ner_chunk_class,assertion,assertion_confidence
0,fever,VS_Finding,Present,1.0
0,sore throat,Symptom,Present,1.0
0,stomach pain,Symptom,Absent,1.0
0,pain,Symptom,Hypothetical,0.9973
0,short of breath,Symptom,Present,1.0
0,climbing a flight of stairs,Symptom,Present,0.9434
0,Alzheimer,Disease_Syndrome_Disorder,Family,0.8136


# Oncological Assertion Models

Oncology Assertion Models

|    | model_name              |Predicted Entities|
|---:|:------------------------|-|
| 1 | [assertion_oncology_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_wip_en.html) | Medical_History, Family_History, Possible, Hypothetical_Or_Absent|
| 2 | [assertion_oncology_problem_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_problem_wip_en.html) |Present, Possible, Hypothetical, Absent, Family|
| 3 | [assertion_oncology_treatment_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_treatment_binary_wip_en.html) |Present, Planned, Past, Hypothetical, Absent|
| 3 | [assertion_oncology_treatment_wip]() |Present, Planned, Past, Hypothetical, Absent|
| 4 | [assertion_oncology_response_to_treatment_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_response_to_treatment_wip_en.html) |Present_Or_Past, Hypothetical_Or_Absent|
| 5 | [assertion_oncology_test_binary_wip](https://nlp.johnsnowlabs.com/2022/10/01/assertion_oncology_test_binary_wip_en.html) |Present_Or_Past, Hypothetical_Or_Absent|
| 6 | [assertion_oncology_smoking_status_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_smoking_status_wip_en.html) |Absent, Past, Present|
| 7 | [assertion_oncology_family_history_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_family_history_wip_en.html) |Family_History, Other|
| 8 | [assertion_oncology_demographic_binary_wip](https://nlp.johnsnowlabs.com/2022/10/11/assertion_oncology_demographic_binary_wip_en.html) |Patient, Someone_Else|

In [None]:
embeddings = 'embeddings_clinical'

ner_model_name = 'ner_oncology_wip'

assertion_model_name='assertion_oncology_wip'

nrows = 100

ner_df = get_clinical_assertion (embeddings, mt_samples_df, nrows, ner_model_name,assertion_model_name )

ner_df.show(truncate = False)

ner_oncology_wip download started this may take some time.
[OK!]
assertion_oncology_wip download started this may take some time.
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
+------------------------+--------------------+------------+----------+
|chunk                   |ner_label           |assertion   |confidence|
+------------------------+--------------------+------------+----------+
|Mesothelioma            |Cancer_Dx           |Present     |0.9885    |
|Mesothelioma            |Cancer_Dx           |Hypothetical|0.981     |
|August 24, 2007         |Date                |Past        |0.9726    |
|decortication           |Cancer_Surgery      |Past        |0.994     |
|lung                    |Site_Lung           |Past        |0.9453    |
|pleural                 |Site_Other_Body_Part|Past        |0.9624    |
|biopsy                  |Pathology_Test      |Past        |0.9979    |
|transpleural fluoroscopy|Imaging_Test  

# Assertion Filterer
AssertionFilterer will allow you to filter out the named entities by the list of acceptable assertion statuses. This annotator would be quite handy if you want to set a white list for the acceptable assertion statuses like present or conditional; and do not want absent conditions get out of your pipeline.

In [None]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\
    #.setIncludeAllConfidenceScores(False)

ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

clinical_assertion = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

assertion_filterer = AssertionFilterer()\
    .setInputCols("sentence","ner_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setWhiteList(["PREsent"])

nlpPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      clinical_assertion,
      assertion_filterer
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
assertionFilter_model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]
assertion_jsl_augmented download started this may take some time.
[OK!]


In [None]:
text = 'Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. Alopecia noted. She denies pain.'

light_model = LightPipeline(assertionFilter_model)
light_result = light_model.annotate(text)

light_result.keys()

dict_keys(['assertion_filtered', 'document', 'ner_chunk', 'assertion', 'token', 'ner', 'embeddings', 'sentence'])

In [None]:
list(zip(light_result['ner_chunk'], light_result['assertion']))

[('a headache', 'Present'),
 ('a head CT', 'Planned'),
 ('anxious', 'Possible'),
 ('Alopecia', 'Present'),
 ('pain', 'Absent')]

In [None]:
assertion_filterer.getWhiteList()

['PREsent']

In [None]:
chunks=[]
entities=[]
status=[]
confidence=[]

light_result = light_model.fullAnnotate(text)[0]

for m in light_result['assertion_filtered']:
    
    chunks.append(m.result)
    entities.append(m.metadata['entity']) 
    status.append(m.metadata['assertion'])
    confidence.append(m.metadata['confidence'])
        
df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status, 'confidence':confidence})

df

Unnamed: 0,chunks,entities,assertion,confidence
0,a headache,PROBLEM,Present,0.97150004
1,Alopecia,PROBLEM,Present,0.9949


As you see, there is no "pain" chunk since it has "absent" assertion label. 

# AssertionChunkConverter

In some cases, there may be issues while creating the chunk column by using token indices and losing some data while training and testing the assertion status model if there are issues in these token indices. So we developed a new `AssertionChunkConverter` annotator that takes **begin and end indices of the chunks** as input and creates an extended chunk column with metadata that can be used for assertion status detection model training.

*NOTE*: Chunk begin and end indices in the assertion status model training dataframe can be populated using the new version of ALAB module.

In [None]:
data = spark.createDataFrame([["An angiography showed bleeding in two vessels off of the Minnie supplying the sigmoid that were succesfully embolized.", "Minnie", 57, 63],
     ["After discussing this with his PCP, Leon was clear that the patient had had recurrent DVTs and ultimately a PE and his PCP felt strongly that he required long-term anticoagulation ", "PCP", 31, 34]])\
     .toDF("text", "target", "char_begin", "char_end")

data.show()

+--------------------+------+----------+--------+
|                text|target|char_begin|char_end|
+--------------------+------+----------+--------+
|An angiography sh...|Minnie|        57|      63|
|After discussing ...|   PCP|        31|      34|
+--------------------+------+----------+--------+



In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("tokens")

converter = AssertionChunkConverter() \
    .setInputCols("tokens")\
    .setChunkTextCol("target")\
    .setChunkBeginCol("char_begin")\
    .setChunkEndCol("char_end")\
    .setOutputTokenBeginCol("token_begin")\
    .setOutputTokenEndCol("token_end")\
    .setOutputCol("chunk")

pipeline = Pipeline().setStages([document_assembler,sentenceDetector, tokenizer, converter])

results = pipeline.fit(data).transform(data)

In [None]:
results\
    .selectExpr(
        "target",
        "char_begin",
        "char_end",
        "token_begin",
        "token_end",
        "tokens[token_begin].result",
        "tokens[token_end].result",
        "target",
        "chunk")\
    .show(truncate=False)

+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|target|char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target|chunk                                         |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|Minnie|57        |63      |10         |10       |Minnie                    |Minnie                  |Minnie|[{chunk, 57, 62, Minnie, {sentence -> 0}, []}]|
|PCP   |31        |34      |5          |5        |PCP                       |PCP                     |PCP   |[{chunk, 31, 33, PCP, {sentence -> 0}, []}]   |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+



# Train a custom Assertion Model

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/i2b2_assertion_sample_short.csv

In [None]:
import pandas as pd

In [None]:
assertion_df = spark.read.option("header", True).option("inferSchema", "True").csv("i2b2_assertion_sample_short.csv")

In [None]:
assertion_df.show(3, truncate=100)

+-------------------------------------------------+-------------------+-------+-----+---+
|                                             text|             target|  label|start|end|
+-------------------------------------------------+-------------------+-------+-----+---+
|She has no history of liver disease , hepatitis .|      liver disease| absent|    5|  6|
|                         1. Undesired fertility .|undesired fertility|present|    1|  2|
|                            3) STATUS POST FALL .|               fall|present|    3|  3|
+-------------------------------------------------+-------------------+-------+-----+---+
only showing top 3 rows



In [None]:
(training_data, test_data) = assertion_df.randomSplit([0.8, 0.2], seed = 100)
print("Training Dataset Count: " + str(training_data.count()))
print("Test Dataset Count: " + str(test_data.count()))

Training Dataset Count: 721
Test Dataset Count: 170


In [None]:
training_data.groupBy('label').count().orderBy('count', ascending=False).show(truncate=False)

+-------+-----+
|label  |count|
+-------+-----+
|present|546  |
|absent |175  |
+-------+-----+



In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

chunk = Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("chunk")\
    .setChunkCol("target")\
    .setStartCol("start")\
    .setStartColByTokenIndex(True)\
    .setFailOnMissing(False)\
    .setLowerCase(True)

token = Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')

embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


We will transform our test data with a pipeline consisting of same steps with the pipeline which contains AssertionDLApproach.
By doing this, we enable that test data will have same columns with training data in AssertionDLApproach. <br/>
The goal of this implementation is enabling the usage of `setTestDataset()` parameter in AssertionDLApproach. 

In [None]:
clinical_assertion_pipeline = Pipeline(
    stages = [
    document,
    chunk,
    token,
    embeddings])

assertion_test_data = clinical_assertion_pipeline.fit(test_data).transform(test_data)

In [None]:
assertion_test_data.columns

['text',
 'target',
 'label',
 'start',
 'end',
 'document',
 'chunk',
 'token',
 'embeddings']

We save the test data in parquet format to use in `AssertionDLApproach()`. 

In [None]:
assertion_test_data.write.parquet('i2b2_assertion_sample_test_data.parquet')

## Graph setup

In [None]:
!pip install -q tensorflow==2.11.0
!pip install -q tensorflow-addons

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m588.3/588.3 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m72.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.2/439.2 kB[0m [31m48.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m83.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m105.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m591.0/591.0 kB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
[?25h

We will use TFGraphBuilder annotator which can be used to create graphs in the model training pipeline. 

TFGraphBuilder inspects the data and creates the proper graph if a suitable version of TensorFlow (<= 2.7 ) is available. The graph is stored in the defined folder and loaded by the approach.

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder= "./tf_graphs"

assertion_graph_builder = TFGraphBuilder()\
    .setModelName("assertion_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("assertion_graph.pb")\
    .setMaxSequenceLength(250)\
    .setHiddenUnitsNumber(25)

In [None]:
'''
# ready to use tf_graph

!mkdir training_logs
!mkdir assertion_tf_graph

!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/tf_graphs/blstm_34_32_30_200_2.pb -P /content/assertion_tf_graph
'''

In [None]:
'''
# create custom graph

from sparknlp_jsl.training import tf_graph
tf_graph.print_model_params("assertion_dl")

feat_size = 200
n_classes = 6

tf_graph.build("assertion_dl",
              build_params={"n_classes": n_classes},
              model_location= "./tf_graphs", 
              model_filename="blstm_34_32_30_{}_{}.pb".format(feat_size, n_classes))
'''

**Setting the Scope Window (Target Area) Dynamically in Assertion Status Detection Models**


This parameter allows you to train the Assertion Status Models to focus on specific context windows when resolving the status of a NER chunk. The window is in format `[X,Y]` being `X` the number of tokens to consider on the left of the chunk, and `Y` the max number of tokens to consider on the right. Let’s take a look at what different windows mean:


*   By default, the window is `[-1,-1]` which means that the Assertion Status will look at all of the tokens in the sentence/document (up to a maximum of tokens set in `setMaxSentLen()` ).
*   `[0,0]` means “don’t pay attention to any token except the ner_chunk”, what basically is not considering any context for the Assertion resolution.
*   `[9,15]` is what empirically seems to be the best baseline, meaning that we look up to 9 tokens on the left and 15 on the right of the ner chunk to understand the context and resolve the status.


Check this [Scope Window Tuning Assertion Status Detection notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.1.Scope_window_tuning_assertion_status_detection.ipynb)  that illustrates the effect of the different windows and how to properly fine-tune your AssertionDLModels to get the best of them.

In our case, the best Scope Window is around [10,10]

In [None]:
scope_window = [10,10]

assertionStatus = AssertionDLApproach()\
    .setLabelCol("label")\
    .setInputCols("document", "chunk", "embeddings")\
    .setOutputCol("assertion")\
    .setBatchSize(64)\
    .setDropout(0.1)\
    .setLearningRate(0.001)\
    .setEpochs(20)\
    .setValidationSplit(0.2)\
    .setStartCol("start")\
    .setEndCol("end")\
    .setMaxSentLen(250)\
    .setIncludeConfidence(True)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath('training_logs/')\
    .setGraphFolder(graph_folder)\
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")\
    .setTestDataset(path="/content/i2b2_assertion_sample_test_data.parquet")\
    .setScopeWindow(scope_window)

'''
If .setTestDataset parameter is employed, raw test data cannot be fitted. .setTestDataset only works for dataframes which are correctly transformed
by a pipeline consisting of document, chunk, embeddings stages.
'''

'\nIf .setTestDataset parameter is employed, raw test data cannot be fitted. .setTestDataset only works for dataframes which are correctly transformed\nby a pipeline consisting of document, chunk, embeddings stages.\n'

In [None]:
'''
assertionStatus = AssertionLogRegApproach()\
    .setLabelCol("label")\
    .setInputCols("document", "chunk", "embeddings")\
    .setOutputCol("assertion")\
    .setMaxIter(100) # default: 26
'''

In [None]:
clinical_assertion_pipeline = Pipeline(
    stages = [
    document,
    chunk,
    token,
    embeddings,
    assertion_graph_builder,
    assertionStatus])

In [None]:
%%time

assertion_model = clinical_assertion_pipeline.fit(training_data)

TF Graph Builder configuration:
Model name: assertion_dl
Graph folder: ./tf_graphs
Graph file name: assertion_graph.pb
Build params: {'n_classes': 2, 'feat_size': 200, 'max_seq_len': 250, 'n_hidden': 25}


Instructions for updating:
non-resource variables are not supported in the long term


Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5



Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5

assertion_dl graph exported to ./tf_graphs/assertion_graph.pb
CPU times: user 12.5 s, sys: 1.23 s, total: 13.7 s
Wall time: 2min 49s


## Checking the results

Checking the results saved in the log file

In [None]:
import os

log_files = os.listdir("./training_logs")
log_files

['AssertionDLApproach_dbc279e69879.log']

In [None]:
with open("./training_logs/"+log_files[0]) as log_file:
    print(log_file.read())

Name of the selected graph: ./tf_graphs/assertion_graph.pb
Training started, trainExamples: 721


Epoch: 0 started, learning rate: 0.001, dataset size: 577
Done, 5.639568454 total training loss: 9.557581, avg training loss: 0.9557581, batches: 10
Quality on validation dataset (20.0%), validation examples = 144
time to finish evaluation: 0.86s
Total validation loss: 1.8580	Avg validation loss: 0.6193
label	 tp	 fp	 fn	 prec	 rec	 f1
present	 91	 23	 20	 0.7982456	 0.8198198	 0.8088889
absent	 10	 20	 23	 0.33333334	 0.3030303	 0.31746033
tp: 101 fp: 43 fn: 43 labels: 2
Macro-average	 prec: 0.56578946, rec: 0.5614251, f1: 0.5635989
Micro-average	 prec: 0.7013889, rec: 0.7013889, f1: 0.7013889


Quality on test dataset: 
time to finish evaluation: 0.83s
Total test loss: 1.9553	Avg test loss: 0.6518
label	 tp	 fp	 fn	 prec	 rec	 f1
present	 91	 39	 26	 0.7	 0.7777778	 0.73684216
absent	 14	 26	 39	 0.35	 0.26415095	 0.30107528
tp: 105 fp: 65 fn: 65 labels: 2
Macro-average	 prec: 0.525, rec

In [None]:
preds = assertion_model.transform(test_data).select('label','assertion.result')

preds.show()

+-------+---------+
|  label|   result|
+-------+---------+
|present|[present]|
| absent|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present| [absent]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
+-------+---------+
only showing top 20 rows



In [None]:
preds_df = preds.toPandas()

In [None]:
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])
preds_df

Unnamed: 0,label,result
0,present,present
1,absent,present
2,present,present
3,present,present
4,present,present
...,...,...
165,present,present
166,absent,absent
167,absent,absent
168,absent,absent


In [None]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report( preds_df['label'], preds_df['result']))

              precision    recall  f1-score   support

      absent       0.77      0.77      0.77        53
     present       0.90      0.90      0.90       117

    accuracy                           0.86       170
   macro avg       0.84      0.84      0.84       170
weighted avg       0.86      0.86      0.86       170



In [None]:
# save model
assertion_model.stages[-1].write().overwrite().save('assertion_custom_model')

## Load saved model

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


In [None]:
clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

clinical_assertion = AssertionDLModel.load("assertion_custom_model") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
    
nlpPipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    clinical_assertion
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)


ner_clinical download started this may take some time.
[OK!]


In [None]:
text = 'Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. No alopecia and pain noted'


light_model = LightPipeline(model)

light_result = light_model.fullAnnotate(text)[0]

print(text)

chunks=[]
entities=[]
status=[]
confidence=[]

for n,m in zip(light_result['ner_chunk'],light_result['assertion']):
    
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    status.append(m.result)
    confidence.append(m.metadata['confidence'])
        
df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status, 'confidence':confidence})

df

Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. No alopecia and pain noted


Unnamed: 0,chunks,entities,assertion,confidence
0,a headache,PROBLEM,present,0.9699
1,a head CT,TEST,present,0.9943
2,anxious,PROBLEM,present,0.7399
3,alopecia,PROBLEM,absent,0.7673
4,pain,PROBLEM,absent,0.7809
