![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb)

# Clinical Assertion Status Model 


The deep neural network architecture for assertion status detection in Spark NLP is based on a Bi-LSTM framework, and is a modified version of the architecture proposed by Federico Fancellu, Adam Lopez and Bonnie Webber ([Neural Networks For Negation Scope Detection](https://aclanthology.org/P16-1047.pdf)). Its goal is to classify the assertions made on given medical concepts as being present, absent, or possible in the patient, conditionally present in the patient under certain circumstances,
hypothetically present in the patient at some future point, and
mentioned in the patient report but associated with someoneelse.
In the proposed implementation, input units depend on the
target tokens (a named entity) and the neighboring words that
are explicitly encoded as a sequence using word embeddings.
Similar to paper mentioned above,  it is observed that that 95% of the scope tokens (neighboring words) fall in a window of 9 tokens to the left and 15
to the right of the target tokens in the same dataset. Therefore, the same window size was implemented and it following parameters were used: learning
rate 0.0012, dropout 0.05, batch size 64 and a maximum sentence length 250. The model has been implemented within
Spark NLP as an annotator called AssertionDLModel. After
training 20 epoch and measuring accuracy on the official test
set, this implementation exceeds the latest state-of-the-art
accuracy benchmarks as summarized as following table:

|Assertion Label|Spark NLP|Latest Best|
|-|-|-|
|Absent       |0.944 |0.937|
|Someone-else |0.904|0.869|
|Conditional  |0.441|0.422|
|Hypothetical |0.862|0.890|
|Possible     |0.680|0.630|
|Present      |0.953|0.957|
|micro F1     |0.939|0.934|


**Colab Setup**

In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark, nlu and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 nlu==4.0.1rc2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import json
import os

from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession

import nlu
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

import pandas as pd

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 4.0.0
Spark NLP_JSL Version : 4.0.0


In [None]:
# if you want to start the session with custom params as in start function above
from pyspark.sql import SparkSession

def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+PUBLIC_VERSION) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+SECRET+"/spark-nlp-jsl-"+JSL_VERSION+".jar")
      
    return builder.getOrCreate()

#spark = start(SECRET)

# Clinical Assertion Models (with pretrained models)

|    | model_name              |Predicted Entities|
|---:|:------------------------|-|
|  1 | [assertion_dl](https://nlp.johnsnowlabs.com/2021/01/26/assertion_dl_en.html)            |Present, Absent, Possible, Planned, Someoneelse, Past, Family, None, Hypotetical|
|  2 | [assertion_dl_biobert](https://nlp.johnsnowlabs.com/2021/01/26/assertion_dl_biobert_en.html)    |absent, present, conditional, associated_with_someone_else, hypothetical, possible|
|  3 | [assertion_dl_healthcare](https://nlp.johnsnowlabs.com/2020/09/23/assertion_dl_healthcare_en.html) |absent, present, conditional, associated_with_someone_else, hypothetical, possible|
|  4 | [assertion_dl_large](https://nlp.johnsnowlabs.com/2020/05/21/assertion_dl_large_en.html)      |hypothetical, present, absent, possible, conditional, associated_with_someone_else|
|  5 | [assertion_dl_radiology](https://nlp.johnsnowlabs.com/2021/03/18/assertion_dl_radiology_en.html)   |Confirmed, Suspected, Negative|
|  6 | [assertion_jsl](https://nlp.johnsnowlabs.com/2021/07/24/assertion_jsl_en.html)           |Present, Absent, Possible, Planned, Someoneelse, Past, Family, None, Hypotetical|
|  7 | [assertion_jsl_large](https://nlp.johnsnowlabs.com/2021/07/24/assertion_jsl_large_en.html)     |present, absent, possible, planned, someoneelse, past|
|  8 |  [assertion_ml](https://nlp.johnsnowlabs.com/2020/01/30/assertion_ml_en.html) |Hypothetical, Present, Absent, Possible, Conditional, Associated_with_someone_else|
|  9 | [assertion_dl_scope_L10R10](https://nlp.johnsnowlabs.com/2022/03/17/assertion_dl_scope_L10R10_en_3_0.html)| hypothetical, associated_with_someone_else, conditional, possible, absent, present|
| 10 | [assertion_dl_biobert_scope_L10R10](https://nlp.johnsnowlabs.com/2022/03/24/assertion_dl_biobert_scope_L10R10_en_2_4.html)| hypothetical, associated_with_someone_else, conditional, possible, absent, present|

### Pretrained `assertion_dl` model

In [4]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

from sparknlp_jsl.annotator import *

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\
    #.setIncludeAllConfidenceScores(False)

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

# Assertion model trained on i2b2 (sampled from MIMIC) dataset
# coming from sparknlp_jsl.annotator !!
clinical_assertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
    
nlpPipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    clinical_assertion
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]
assertion_dl download started this may take some time.
[OK!]


In [5]:
AssertionDLApproach().extractParamMap()

{Param(parent='AssertionDLApproach_965a126a8f72', name='batchSize', doc='Size for each batch in the optimization process'): 64,
 Param(parent='AssertionDLApproach_965a126a8f72', name='dropout', doc='Dropout at the output of each layer'): 0.05,
 Param(parent='AssertionDLApproach_965a126a8f72', name='epochs', doc='Number of epochs for the optimization process'): 5,
 Param(parent='AssertionDLApproach_965a126a8f72', name='includeConfidence', doc='whether to include confidence scores in annotation metadata'): False,
 Param(parent='AssertionDLApproach_965a126a8f72', name='label', doc='Column with one label per document'): 'label',
 Param(parent='AssertionDLApproach_965a126a8f72', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='AssertionDLApproach_965a126a8f72', name='learningRate', doc='Learning rate for the optimization process'): 0.0012,
 Param(parent='AssertionDLApproach_965a126a8f72', name='maxSentLen', doc='Max length fo

In [None]:
# we also have a LogReg based Assertion Model.
'''
clinical_assertion_ml = AssertionLogRegModel.pretrained("assertion_ml", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
'''

In [6]:
import pandas as pd

text = 'Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. No alopecia noted. She denies pain'

print (text)

light_model = LightPipeline(model)

light_result = light_model.fullAnnotate(text)[0]

chunks=[]
entities=[]
status=[]

for n,m in zip(light_result['ner_chunk'],light_result['assertion']):
    
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    status.append(m.result)
        
df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})

df

Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. No alopecia noted. She denies pain


Unnamed: 0,chunks,entities,assertion
0,a headache,PROBLEM,present
1,a head CT,TEST,present
2,anxious,PROBLEM,present
3,alopecia,PROBLEM,absent
4,pain,PROBLEM,absent


In [7]:
text = 'Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. No alopecia noted. She denies pain'

nlu.to_pretty_df(model,text,output_level='chunk').columns

Index(['index', 'assertion', 'assertion_confidence', 'document',
       'entities_ner_chunk', 'entities_ner_chunk_class',
       'entities_ner_chunk_confidence', 'entities_ner_chunk_origin_chunk',
       'entities_ner_chunk_origin_sentence', 'sentence_pragmatic',
       'word_embedding_embeddings'],
      dtype='object')

In [8]:
cols = [
     'entities_ner_chunk',
     'entities_ner_chunk_class', 
     'assertion',]
     
df = nlu.to_pretty_df(model,text,output_level='chunk')[cols]
df


Unnamed: 0,entities_ner_chunk,entities_ner_chunk_class,assertion
0,a headache,PROBLEM,present
1,a head CT,TEST,present
2,anxious,PROBLEM,present
3,alopecia,PROBLEM,absent
4,pain,PROBLEM,absent


In [9]:
light_model = LightPipeline(model)

light_result = light_model.fullAnnotate(text)[0]

from sparknlp_display import AssertionVisualizer

vis = AssertionVisualizer()

vis.set_label_colors({'TEST':'#008080', 'PROBLEM':'#800080'})

vis.display(light_result, 'ner_chunk', 'assertion')

In [10]:
! wget -q	https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed_sample_text_small.csv

In [11]:
import pyspark.sql.functions as F

pubMedDF = spark.read\
                .option("header", "true")\
                .csv("pubmed_sample_text_small.csv")\
                
pubMedDF.show(truncate=50)

+--------------------------------------------------+
|                                              text|
+--------------------------------------------------+
|The human KCNJ9 (Kir 3.3, GIRK3) is a member of...|
|BACKGROUND: At present, it is one of the most i...|
|OBJECTIVE: To investigate the relationship betw...|
|Combined EEG/fMRI recording has been used to lo...|
|Kohlschutter syndrome is a rare neurodegenerati...|
|Statistical analysis of neuroimages is commonly...|
|The synthetic DOX-LNA conjugate was characteriz...|
|Our objective was to compare three different me...|
|We conducted a phase II study to assess the eff...|
|"""Monomeric sarcosine oxidase (MSOX) is a flav...|
|We presented the tachinid fly Exorista japonica...|
|The literature dealing with the water conductin...|
|A novel approach to synthesize chitosan-O-isopr...|
|An HPLC-ESI-MS-MS method has been developed for...|
|The localizing and lateralizing values of eye a...|
|OBJECTIVE: To evaluate the effectiveness and 

In [12]:
result = model.transform(pubMedDF.limit(100))

In [13]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|           assertion|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|The human KCNJ9 (...|[{document, 0, 95...|[{document, 0, 12...|[{token, 0, 2, Th...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 48, 106,...|[{assertion, 48, ...|
|BACKGROUND: At pr...|[{document, 0, 14...|[{document, 0, 19...|[{token, 0, 9, BA...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 67, 79, ...|[{assertion, 67, ...|
|OBJECTIVE: To inv...|[{document, 0, 15...|[{document, 0, 30...|[{token, 0, 8, OB...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 188, 231...|[{

In [14]:
result.select('sentence.result').take(1)

[Row(result=['The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family.', 'Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population.', 'The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively.', 'We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair', '(bp) insertion/deletion.', 'Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle.', 'The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.'])]

In [15]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,  
                                     result.ner_chunk.begin, 
                                     result.ner_chunk.end, 
                                     result.ner_chunk.metadata, 
                                     result.assertion.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['3']['sentence']").alias("sent_id"),
              F.expr("cols['4']").alias("assertion") ).show(truncate=False)


+-----------------------------------------------------------+-----+---+---------+-------+-----------+
|chunk                                                      |begin|end|ner_label|sent_id|assertion  |
+-----------------------------------------------------------+-----+---+---------+-------+-----------+
|the G-protein-activated inwardly rectifying potassium (GIRK|48   |106|TREATMENT|0      |conditional|
|the genomicorganization                                    |142  |164|TREATMENT|1      |present    |
|a candidate gene forType II diabetes mellitus              |210  |254|PROBLEM  |1      |present    |
|byapproximately                                            |380  |394|TREATMENT|2      |present    |
|single nucleotide polymorphisms                            |464  |494|TREATMENT|3      |present    |
|aVal366Ala substitution                                    |532  |554|PROBLEM  |3      |present    |
|an 8 base-pair                                             |561  |574|PROBLEM  |3

### Pretrained `assertion_dl_radiology` model

In [16]:
from sparknlp_jsl.annotator import *

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = SentenceDetectorDLModel\
    .pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model for radiology
radiology_ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\
    #.setIncludeAllConfidenceScores(False)

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["ImagingFindings"])

# Assertion model trained on radiology dataset
# coming from sparknlp_jsl.annotator !!

radiology_assertion = AssertionDLModel.pretrained("assertion_dl_radiology", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

nlpPipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    radiology_ner,
    ner_converter,
    radiology_assertion
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
radiologyAssertion_model = nlpPipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_radiology download started this may take some time.
[OK!]
assertion_dl_radiology download started this may take some time.
[OK!]


In [17]:
# A sample text from a radiology report

text = """No right-sided pleural effusion or pneumothorax is definitively seen and there are mildly displaced fractures of the left lateral 8th and likely 9th ribs."""

In [18]:
data = spark.createDataFrame([[text]]).toDF("text")

In [19]:
result = radiologyAssertion_model.transform(data)

In [20]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata, 
                                     result.assertion.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['sentence']").alias("sent_id"),
              F.expr("cols['2']").alias("assertion")).show(truncate=False)

+-------------------+---------------+-------+---------+
|chunk              |ner_label      |sent_id|assertion|
+-------------------+---------------+-------+---------+
|effusion           |ImagingFindings|0      |Negative |
|pneumothorax       |ImagingFindings|0      |Negative |
|displaced fractures|ImagingFindings|0      |Confirmed|
+-------------------+---------------+-------+---------+



## Writing a generic Assertion + NER function

In [21]:
from pyspark.sql.functions import monotonically_increasing_id


def get_base_pipeline (embeddings = 'embeddings_clinical'):

    documentAssembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

  # Sentence Detector annotator, processes various sentences per line
    sentenceDetector = SentenceDetector()\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

  # Tokenizer splits words in a relevant format for NLP
    tokenizer = Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

  # Clinical word embeddings trained on PubMED dataset
    word_embeddings = WordEmbeddingsModel.pretrained(embeddings, "en", "clinical/models")\
        .setInputCols(["sentence", "token"])\
        .setOutputCol("embeddings")

    base_pipeline = Pipeline(stages=[
                        documentAssembler,
                        sentenceDetector,
                        tokenizer,
                        word_embeddings])

    return base_pipeline



def get_clinical_assertion (embeddings, spark_df, nrows = 100, model_name = 'ner_clinical'):

  # NER model trained on i2b2 (sampled from MIMIC) dataset
    loaded_ner_model = MedicalNerModel.pretrained(model_name, "en", "clinical/models") \
        .setInputCols(["sentence", "token", "embeddings"]) \
        .setOutputCol("ner")

    ner_converter = NerConverter() \
        .setInputCols(["sentence", "token", "ner"]) \
        .setOutputCol("ner_chunk")

  # Assertion model trained on i2b2 (sampled from MIMIC) dataset
  # coming from sparknlp_jsl.annotator !!
    clinical_assertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
        .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
        .setOutputCol("assertion")
      

    base_model = get_base_pipeline (embeddings)

    nlpPipeline = Pipeline(stages=[
        base_model,
        loaded_ner_model,
        ner_converter,
        clinical_assertion])

    empty_data = spark.createDataFrame([[""]]).toDF("text")

    model = nlpPipeline.fit(empty_data)

    result = model.transform(spark_df.limit(nrows))

    result = result.withColumn("id", monotonically_increasing_id())

    result_df = result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                                     result.ner_chunk.metadata, 
                                                     result.assertion.result)).alias("cols")) \
                      .select(F.expr("cols['0']").alias("chunk"),
                              F.expr("cols['1']['entity']").alias("ner_label"),
                              F.expr("cols['2']").alias("assertion"))\
                      .filter("ner_label!='O'")

    return result_df

In [22]:
embeddings = 'embeddings_clinical'

model_name = 'ner_clinical_large'

nrows = 100

ner_df = get_clinical_assertion (embeddings, pubMedDF, nrows, model_name)

ner_df.show(30,truncate=50)

ner_clinical_large download started this may take some time.
[OK!]
assertion_dl download started this may take some time.
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
+--------------------------------------------------+---------+-----------+
|                                             chunk|ner_label|  assertion|
+--------------------------------------------------+---------+-----------+
|the G-protein-activated inwardly rectifying pot...|TREATMENT|conditional|
|                           the genomicorganization|TREATMENT|    present|
|     a candidate gene forType II diabetes mellitus|  PROBLEM|    present|
|                                   byapproximately|TREATMENT|    present|
|                   single nucleotide polymorphisms|TREATMENT|    present|
|                           aVal366Ala substitution|  PROBLEM|    present|
|                                    an 8 base-pair|  PROBLEM|    present|
|                 

In [23]:
embeddings = 'embeddings_clinical'

model_name = 'ner_posology'

nrows = 100

ner_df = get_clinical_assertion (embeddings, pubMedDF, nrows, model_name)

ner_df.show()

ner_posology download started this may take some time.
[OK!]
assertion_dl download started this may take some time.
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
+----------------+---------+---------+
|           chunk|ner_label|assertion|
+----------------+---------+---------+
|  anthracyclines|     DRUG|  present|
|         taxanes|     DRUG|  present|
|     vinorelbine|     DRUG|  present|
|     vinorelbine|     DRUG|  present|
|  anthracyclines|     DRUG|  present|
|         taxanes|     DRUG|  present|
|  Vinorelbinewas|     DRUG|   absent|
|       25 mg/m(2| STRENGTH|  present|
|   intravenously|    ROUTE|   absent|
|         on days|FREQUENCY|  present|
| thatvinorelbine|     DRUG|  present|
|  anthracyclines|     DRUG|  present|
|         taxanes|     DRUG|  present|
|             DOX|     DRUG|   absent|
|    trandolapril|     DRUG| possible|
|        losartan|     DRUG|  present|
|          3-week| DURATION|   ab

In [24]:
embeddings = 'embeddings_clinical'

model_name = 'ner_posology_greedy'

entry_data = spark.createDataFrame([["The patient did not take a capsule of Advil."]]).toDF("text")

ner_df = get_clinical_assertion (embeddings, entry_data, nrows, model_name)

ner_df.show()

ner_posology_greedy download started this may take some time.
[OK!]
assertion_dl download started this may take some time.
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
+----------------+---------+---------+
|           chunk|ner_label|assertion|
+----------------+---------+---------+
|capsule of Advil|     DRUG|   absent|
+----------------+---------+---------+



In [25]:
embeddings = 'embeddings_clinical'

model_name = 'ner_clinical'

entry_data = spark.createDataFrame([["The patient has no fever"]]).toDF("text")

ner_df = get_clinical_assertion (embeddings, entry_data, nrows, model_name)

ner_df.show()

ner_clinical download started this may take some time.
[OK!]
assertion_dl download started this may take some time.
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
+-----+---------+---------+
|chunk|ner_label|assertion|
+-----+---------+---------+
|fever|  PROBLEM|   absent|
+-----+---------+---------+



In [26]:
import pandas as pd

def get_clinical_assertion_light (light_model, text):

  light_result = light_model.fullAnnotate(text)[0]

  chunks=[]
  entities=[]
  status=[]

  for n,m in zip(light_result['ner_chunk'],light_result['assertion']):
      
      chunks.append(n.result)
      entities.append(n.metadata['entity']) 
      status.append(m.result)
          
  df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})

  return df

In [27]:
clinical_text = """
Patient with severe fever and sore throat. 
He shows no stomach pain and he maintained on an epidural and PCA for pain control.
He also became short of breath with climbing a flight of stairs.
After CT, lung tumor located at the right lower lobe. Father with Alzheimer.
"""

light_model = LightPipeline(model)

# get_clinical_assertion_light (light_model, clinical_text)

cols = [
     'entities_ner_chunk',
     'entities_ner_chunk_class', 
     'assertion',]
     
df = nlu.to_pretty_df(light_model,clinical_text, output_level='chunk')[cols]
df

Unnamed: 0,entities_ner_chunk,entities_ner_chunk_class,assertion
0,severe fever,PROBLEM,present
1,sore throat,PROBLEM,present
2,stomach pain,PROBLEM,absent
3,an epidural,TREATMENT,present
4,PCA,TREATMENT,present
5,pain control,PROBLEM,present
6,short of breath,PROBLEM,conditional
7,CT,TEST,present
8,lung tumor,PROBLEM,present
9,Alzheimer,PROBLEM,associated_with_someone_else


## Assertion with BioNLP (Cancer Genetics) NER

In [28]:
embeddings = 'embeddings_clinical'

model_name = 'ner_bionlp'

nrows = 100

ner_df = get_clinical_assertion (embeddings, pubMedDF, nrows, model_name)

ner_df.show(truncate = False)

ner_bionlp download started this may take some time.
[OK!]
assertion_dl download started this may take some time.
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
+----------------------+--------------------+-----------+
|chunk                 |ner_label           |assertion  |
+----------------------+--------------------+-----------+
|human                 |Organism            |present    |
|Kir 3.3               |Gene_or_gene_product|present    |
|GIRK3                 |Gene_or_gene_product|present    |
|potassium             |Simple_chemical     |conditional|
|GIRK                  |Gene_or_gene_product|conditional|
|chromosome 1q21-23    |Cellular_component  |present    |
|pancreas              |Organ               |present    |
|tissues               |Tissue              |possible   |
|fat andskeletal muscle|Tissue              |possible   |
|KCNJ9                 |Gene_or_gene_product|present    |
|Type II              

# Assertion Filterer
AssertionFilterer will allow you to filter out the named entities by the list of acceptable assertion statuses. This annotator would be quite handy if you want to set a white list for the acceptable assertion statuses like present or conditional; and do not want absent conditions get out of your pipeline.

In [29]:
clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\
    #.setIncludeAllConfidenceScores(False)

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

clinical_assertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

assertion_filterer = AssertionFilterer()\
    .setInputCols("sentence","ner_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setWhiteList(["present"])

nlpPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      clinical_assertion,
      assertion_filterer
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
assertionFilter_model = nlpPipeline.fit(empty_data)

ner_clinical download started this may take some time.
[OK!]
assertion_dl download started this may take some time.
[OK!]


In [30]:
text = 'Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. Alopecia noted. She denies pain.'

light_model = LightPipeline(assertionFilter_model)
light_result = light_model.annotate(text)

light_result.keys()

dict_keys(['assertion_filtered', 'document', 'ner_chunk', 'assertion', 'token', 'ner', 'embeddings', 'sentence'])

In [31]:
list(zip(light_result['ner_chunk'], light_result['assertion']))

[('a headache', 'present'),
 ('a head CT', 'present'),
 ('anxious', 'present'),
 ('fast', 'present'),
 ('Alopecia', 'present'),
 ('pain', 'absent')]

In [32]:
assertion_filterer.getWhiteList()

['present']

In [33]:
light_result['assertion_filtered']

['a headache', 'a head CT', 'anxious', 'fast', 'Alopecia']

# Train a custom Assertion Model

In [34]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/i2b2_assertion_sample_short.csv

In [35]:
import pandas as pd

In [36]:
assertion_df = spark.read.option("header", True).option("inferSchema", "True").csv("i2b2_assertion_sample_short.csv")

In [37]:
assertion_df.show(3, truncate=100)

+-------------------------------------------------+-------------------+-------+-----+---+
|                                             text|             target|  label|start|end|
+-------------------------------------------------+-------------------+-------+-----+---+
|She has no history of liver disease , hepatitis .|      liver disease| absent|    5|  6|
|                         1. Undesired fertility .|undesired fertility|present|    1|  2|
|                            3) STATUS POST FALL .|               fall|present|    3|  3|
+-------------------------------------------------+-------------------+-------+-----+---+
only showing top 3 rows



In [38]:
(training_data, test_data) = assertion_df.randomSplit([0.8, 0.2], seed = 100)
print("Training Dataset Count: " + str(training_data.count()))
print("Test Dataset Count: " + str(test_data.count()))

Training Dataset Count: 721
Test Dataset Count: 170


In [39]:
training_data.groupBy('label').count().orderBy('count', ascending=False).show(truncate=False)

+-------+-----+
|label  |count|
+-------+-----+
|present|546  |
|absent |175  |
+-------+-----+



In [40]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

chunk = Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("chunk")\
    .setChunkCol("target")\
    .setStartCol("start")\
    .setStartColByTokenIndex(True)\
    .setFailOnMissing(False)\
    .setLowerCase(True)

token = Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')

embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


We will transform our test data with a pipeline consisting of same steps with the pipeline which contains AssertionDLApproach.
By doing this, we enable that test data will have same columns with training data in AssertionDLApproach. <br/>
The goal of this implementation is enabling the usage of `setTestDataset()` parameter in AssertionDLApproach. 

In [41]:
clinical_assertion_pipeline = Pipeline(
    stages = [
    document,
    chunk,
    token,
    embeddings])

assertion_test_data = clinical_assertion_pipeline.fit(test_data).transform(test_data)

In [42]:
assertion_test_data.columns

['text',
 'target',
 'label',
 'start',
 'end',
 'document',
 'chunk',
 'token',
 'embeddings']

We save the test data in parquet format to use in `AssertionDLApproach()`. 

In [43]:
assertion_test_data.write.parquet('i2b2_assertion_sample_test_data.parquet')

## Graph setup

In [None]:
!pip install -q tensorflow==2.7.0
!pip install -q tensorflow-addons

We will use TFGraphBuilder annotator which can be used to create graphs in the model training pipeline. 

TFGraphBuilder inspects the data and creates the proper graph if a suitable version of TensorFlow (<= 2.7 ) is available. The graph is stored in the defined folder and loaded by the approach.

In [45]:
from sparknlp_jsl.annotator import TFGraphBuilder

In [46]:
graph_folder= "./tf_graphs"

In [47]:
assertion_graph_builder = TFGraphBuilder()\
    .setModelName("assertion_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("assertion_graph.pb")\
    .setMaxSequenceLength(250)\
    .setHiddenUnitsNumber(25)

In [None]:
'''
# ready to use tf_graph

!mkdir training_logs
!mkdir assertion_tf_graph

!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/tf_graphs/blstm_34_32_30_200_2.pb -P /content/assertion_tf_graph
'''

In [None]:
'''
# create custom graph

from sparknlp_jsl.training import tf_graph
tf_graph.print_model_params("assertion_dl")

feat_size = 200
n_classes = 6

tf_graph.build("assertion_dl",
              build_params={"n_classes": n_classes},
              model_location= "./tf_graphs", 
              model_filename="blstm_34_32_30_{}_{}.pb".format(feat_size, n_classes))
'''

**Setting the Scope Window (Target Area) Dynamically in Assertion Status Detection Models**


This parameter allows you to train the Assertion Status Models to focus on specific context windows when resolving the status of a NER chunk. The window is in format `[X,Y]` being `X` the number of tokens to consider on the left of the chunk, and `Y` the max number of tokens to consider on the right. Let’s take a look at what different windows mean:


*   By default, the window is `[-1,-1]` which means that the Assertion Status will look at all of the tokens in the sentence/document (up to a maximum of tokens set in `setMaxSentLen()` ).
*   `[0,0]` means “don’t pay attention to any token except the ner_chunk”, what basically is not considering any context for the Assertion resolution.
*   `[9,15]` is what empirically seems to be the best baseline, meaning that we look up to 9 tokens on the left and 15 on the right of the ner chunk to understand the context and resolve the status.


Check this [Scope Window Tuning Assertion Status Detection notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.1.Scope_window_tuning_assertion_status_detection.ipynb)  that illustrates the effect of the different windows and how to properly fine-tune your AssertionDLModels to get the best of them.

In our case, the best Scope Window is around [10,10]

In [48]:
scope_window = [10,10]

assertionStatus = AssertionDLApproach()\
    .setLabelCol("label")\
    .setInputCols("document", "chunk", "embeddings")\
    .setOutputCol("assertion")\
    .setBatchSize(128)\
    .setDropout(0.1)\
    .setLearningRate(0.001)\
    .setEpochs(15)\
    .setValidationSplit(0.2)\
    .setStartCol("start")\
    .setEndCol("end")\
    .setMaxSentLen(250)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath('training_logs/')\
    .setGraphFolder(graph_folder)\
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")\
    .setTestDataset(path="/content/i2b2_assertion_sample_test_data.parquet", read_as='SPARK', options={'format': 'parquet'})\
    .setScopeWindow(scope_window)

'''
If .setTestDataset parameter is employed, raw test data cannot be fitted. .setTestDataset only works for dataframes which are correctly transformed
by a pipeline consisting of document, chunk, embeddings stages.
'''

'\nIf .setTestDataset parameter is employed, raw test data cannot be fitted. .setTestDataset only works for dataframes which are correctly transformed\nby a pipeline consisting of document, chunk, embeddings stages.\n'

In [None]:
'''
assertionStatus = AssertionLogRegApproach()\
    .setLabelCol("label")\
    .setInputCols("document", "chunk", "embeddings")\
    .setOutputCol("assertion")\
    .setMaxIter(100) # default: 26
'''

In [49]:
clinical_assertion_pipeline = Pipeline(
    stages = [
    document,
    chunk,
    token,
    embeddings,
    assertion_graph_builder,
    assertionStatus])

In [50]:
%%time

assertion_model = clinical_assertion_pipeline.fit(training_data)

Medical Graph Builder configuration:
Model name: assertion_dl
Graph folder: ./tf_graphs
Graph file name: assertion_graph.pb
Build params: {'n_classes': 2, 'feat_size': 200, 'max_seq_len': 250, 'n_hidden': 25}
Instructions for updating:
non-resource variables are not supported in the long term
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0

Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0

assertion_dl graph exported to ./tf_graphs/assertion_graph.p

Checking the results saved in the log file

In [51]:
import os

log_files = os.listdir("/content/training_logs")
log_files

['AssertionDLApproach_d80fc8e0a7c2.log']

In [52]:
with open("/content/training_logs/"+log_files[0]) as log_file:
    print(log_file.read())

Name of the selected graph: ./tf_graphs/assertion_graph.pb
Training started, trainExamples: 721


Epoch: 0 started, learning rate: 0.001, dataset size: 577
Done, 5.88557831 total training loss: 4.570156, avg training loss: 0.9140312, batches: 5
Quality on validation dataset (20.0%), validation examples = 144
time to finish evaluation: 1.09s
Total validation loss: 1.4668	Avg validation loss: 0.7334
label	 tp	 fp	 fn	 prec	 rec	 f1
present	 58	 18	 46	 0.7631579	 0.5576923	 0.64444447
absent	 22	 46	 18	 0.32352942	 0.55	 0.40740743
tp: 80 fp: 64 fn: 64 labels: 2
Macro-average	 prec: 0.54334366, rec: 0.5538461, f1: 0.5485446
Micro-average	 prec: 0.5555556, rec: 0.5555556, f1: 0.5555556


Quality on test dataset: 
time to finish evaluation: 1.11s
Total test loss: 1.3779	Avg test loss: 0.6889
label	 tp	 fp	 fn	 prec	 rec	 f1
present	 63	 27	 54	 0.7	 0.53846157	 0.6086957
absent	 26	 54	 27	 0.325	 0.49056605	 0.39097744
tp: 89 fp: 81 fn: 81 labels: 2
Macro-average	 prec: 0.5125, rec: 0.51

In [53]:
preds = assertion_model.transform(test_data).select('label','assertion.result')

preds.show()

+-------+---------+
|  label|   result|
+-------+---------+
|present|[present]|
| absent|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present| [absent]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
|present|[present]|
+-------+---------+
only showing top 20 rows



In [54]:
preds_df = preds.toPandas()

In [55]:
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])
preds_df

Unnamed: 0,label,result
0,present,present
1,absent,present
2,present,present
3,present,present
4,present,present
...,...,...
165,present,present
166,absent,absent
167,absent,absent
168,absent,absent


In [56]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report( preds_df['label'], preds_df['result']))

              precision    recall  f1-score   support

      absent       0.78      0.55      0.64        53
     present       0.82      0.93      0.87       117

    accuracy                           0.81       170
   macro avg       0.80      0.74      0.76       170
weighted avg       0.81      0.81      0.80       170



In [None]:
# save model
assertion_model.stages[-1].write().overwrite().save('assertion_model')