![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/02.2.FewShot_Assertion_Classifier.ipynb)

# Few-Shot Assertion Classifier

**Few-Shot Assertion Classifier Model for Higher Accuracy with Less Data**

The Few-Shot Assertion Classifier Model is an advanced annotator designed to get higher accuracy with fewer data samples inspired by SetFit framework. Few-Shot Assertion models consist of a sentence embedding component paired with a classifier (or head). While current support is focused on MPNet-based Few-Shot Assertion models, future updates will extend compatibility to include other popular models like Bert, DistillBert, and Roberta.

This classifier model supports various classifier types, including sklearn's LogisticRegression and custom PyTorch models, providing flexibility for different model setups. Users are required to specify the classifier type during model export to SparkNLP.

#**Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=True
nlp.install(refresh_install=True)

In [4]:
from johnsnowlabs import nlp, medical

# Automatically load license data and start a session with all jars user has access to
spark_conf = {
    "spark.driver.memory": "48g",
    "spark.driver.cores": "16",
    "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
    "spark.kryoserializer.buffer.max": "2000M",
    "spark.driver.maxResultSize": "20000M",
    "spark.dynamicAllocation.enabled":"false",
    "spark.files.overwrite": "true",
    "spark.extraListeners": "com.johnsnowlabs.license.LicenseLifeCycleManager"
}

spark = nlp.start(
    spark_conf = spark_conf
)
spark

👌 Detected license file /content/5.5.0.spark_nlp_for_healthcare.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.5.0, 💊Spark-Healthcare==5.5.0, running on ⚡ PySpark==3.4.0


In [5]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

# Pretrained Models

|FewShot Assertion Model Name| Predicted Classed | Trained Embeddings |
|----------------------------|-------------------|--------------------|
|[fewhot_assertion_jsl_e5_base_v2_jsl](https://nlp.johnsnowlabs.com/2024/07/03/fewhot_assertion_jsl_e5_base_v2_jsl_en.html) | `Present`, `Absent`, `Possible`, `Planned`, `Past`, `Family`, `Hypotetical`, `SomeoneElse`  | [e5_base_v2_embeddings_medical_assertion_jsl](https://nlp.johnsnowlabs.com/2024/07/03/e5_base_v2_embeddings_medical_assertion_jsl_en.html)  |
|[fewhot_assertion_i2b2_e5_base_v2_i2b2](https://nlp.johnsnowlabs.com/2024/07/03/fewhot_assertion_i2b2_e5_base_v2_i2b2_en.html) | `absent`, `associated_with_someone_else`, `conditional`, `hypothetical`, `possible`, `present`   | [e5_base_v2_embeddings_medical_assertion_i2b2](https://nlp.johnsnowlabs.com/2024/07/03/e5_base_v2_embeddings_medical_assertion_i2b2_en.html)  |
|[fewhot_assertion_sdoh_e5_base_v2_sdoh](https://nlp.johnsnowlabs.com/2024/07/04/fewhot_assertion_sdoh_e5_base_v2_sdoh_en.html)|  `Absent`, `Past`, `Present`, `Someone_Else`, `Hypothetical`, `Possible`  | [e5_base_v2_embeddings_medical_assertion_sdoh](https://nlp.johnsnowlabs.com/2024/07/04/e5_base_v2_embeddings_medical_assertion_sdoh_en.html) |
|[fewhot_assertion_smoking_e5_base_v2_smoking](https://nlp.johnsnowlabs.com/2024/07/03/fewhot_assertion_smoking_e5_base_v2_smoking_en.html) | `Present`, `Absent`, `Past`  | [e5_base_v2_embeddings_medical_assertion_smoking](https://nlp.johnsnowlabs.com/2024/07/03/e5_base_v2_embeddings_medical_assertion_smoking_en.html)  |
|[fewhot_assertion_oncology_e5_base_v2_oncology](https://nlp.johnsnowlabs.com/2024/07/03/fewhot_assertion_oncology_e5_base_v2_oncology_en.html) | `Absent`, `Past`, `Present`, `Family`, `Hypothetical`, `Possible`  | [e5_base_v2_embeddings_medical_assertion_oncology](https://nlp.johnsnowlabs.com/2024/07/03/e5_base_v2_embeddings_medical_assertion_oncology_en.html)   |
|[fewhot_assertion_radiology_e5_base_v2_radiology](https://nlp.johnsnowlabs.com/2024/07/03/fewhot_assertion_radiology_e5_base_v2_radiology_en.html) | `Confirmed`, `Negative`, `Suspected`| [e5_base_v2_embeddings_medical_assertion_radiology](https://nlp.johnsnowlabs.com/2024/07/03/e5_base_v2_embeddings_medical_assertion_radiology_en.html)  |

## Oncology

In [6]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .setSplitChars(["-", "\/"])

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("embeddings")

# ner_oncology
ner_oncology = medical.NerModel.pretrained("ner_oncology","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_oncology")

ner_oncology_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner_oncology"])\
    .setOutputCol("ner_chunk")

few_shot_assertion_converter = medical.FewShotAssertionSentenceConverter()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("assertion_sentence")

e5_embeddings = nlp.E5Embeddings.pretrained("e5_base_v2_embeddings_medical_assertion_oncology", "en", "clinical/models")\
    .setInputCols(["assertion_sentence"])\
    .setOutputCol("assertion_embedding")

few_shot_assertion_classifier = medical.FewShotAssertionClassifierModel()\
    .pretrained("fewhot_assertion_oncology_e5_base_v2_oncology", "en", "clinical/models")\
    .setInputCols(["assertion_embedding"])\
    .setOutputCol("assertion_fewshot")

assertion_pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner_oncology,
        ner_oncology_converter,
        few_shot_assertion_converter,
        e5_embeddings,
        few_shot_assertion_classifier
])

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_oncology download started this may take some time.
[OK!]
e5_base_v2_embeddings_medical_assertion_oncology download started this may take some time.
Approximate size to download 375.4 MB
[OK!]
fewhot_assertion_oncology_e5_base_v2_oncology download started this may take some time.
[OK!]


In [7]:
sample_text= [
"""The patient is suspected to have colorectal cancer. Her family history is positive for other cancers. The result of the biopsy was positive. A CT scan was ordered to rule out metastases."""
]

data = spark.createDataFrame([sample_text]).toDF("text")

result = assertion_pipeline.fit(data).transform(data)

In [8]:
result.select("assertion_fewshot").show(1, truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [9]:
result.select(F.explode(F.arrays_zip(result.assertion_fewshot.metadata,
                                     result.assertion_fewshot.begin,
                                     result.assertion_fewshot.end,
                                     result.assertion_fewshot.result,)).alias("cols")) \
      .select(F.expr("cols['0']['ner_chunk']").alias("ner_chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['0']['ner_label']").alias("ner_label"),
              F.expr("cols['3']").alias("assertion"),
              F.expr("cols['0']['confidence']").alias("confidence") ).show(truncate=False)

+-----------------+-----+---+----------------+---------+----------+
|ner_chunk        |begin|end|ner_label       |assertion|confidence|
+-----------------+-----+---+----------------+---------+----------+
|colorectal cancer|33   |49 |Cancer_Dx       |Possible |0.5812815 |
|Her              |52   |54 |Gender          |Present  |0.9562998 |
|cancers          |93   |99 |Cancer_Dx       |Family   |0.2346564 |
|biopsy           |120  |125|Pathology_Test  |Past     |0.95732147|
|positive         |131  |138|Pathology_Result|Present  |0.9564386 |
|CT scan          |143  |149|Imaging_Test    |Past     |0.9571699 |
|metastases       |175  |184|Metastasis      |Possible |0.54986554|
+-----------------+-----+---+----------------+---------+----------+



 **Display the result of the FewShotAssertionClassifierModel using sparknlp_display.**

In [10]:
from google.colab import widgets
from sparknlp_display import AssertionVisualizer

assertion_visualiser = AssertionVisualizer()
results = result.collect()

In [11]:
for  i  in range(len(results)):
  assertion_visualiser.display(results[i], label_col ='ner_chunk', assertion_col='assertion_fewshot')

# Train a custom Few-Shot Assertion Model

In [12]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/i2b2_assertion_sample_short.csv

In [13]:
import pandas as pd

assertion_df = spark.read.option("header", True).option("inferSchema", "True").csv("i2b2_assertion_sample_short.csv")

assertion_df.show(3, truncate=100)

+-------------------------------------------------+-------------------+-------+-----+---+
|                                             text|             target|  label|start|end|
+-------------------------------------------------+-------------------+-------+-----+---+
|She has no history of liver disease , hepatitis .|      liver disease| absent|    5|  6|
|                         1. Undesired fertility .|undesired fertility|present|    1|  2|
|                            3) STATUS POST FALL .|               fall|present|    3|  3|
+-------------------------------------------------+-------------------+-------+-----+---+
only showing top 3 rows



In [14]:
(training_data, test_data) = assertion_df.randomSplit([0.8, 0.2], seed = 100)
print("Training Dataset Count: " + str(training_data.count()))
print("Test Dataset Count: " + str(test_data.count()))

Training Dataset Count: 721
Test Dataset Count: 170


In [15]:
training_data.groupBy('label').count().orderBy('count', ascending=False).show(truncate=False)

test_data.groupBy('label').count().orderBy('count', ascending=False).show(truncate=False)

+-------+-----+
|label  |count|
+-------+-----+
|present|546  |
|absent |175  |
+-------+-----+

+-------+-----+
|label  |count|
+-------+-----+
|present|117  |
|absent |53   |
+-------+-----+



In [16]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

chunk2doc = medical.Doc2ChunkInternal()\
    .setInputCols(["document","token"])\
    .setOutputCol("ner_chunk")\
    .setChunkCol("target")\
    .setStartCol("start")\
    .setStartColByTokenIndex(True)\
    .setFailOnMissing(False)\
    .setLowerCase(True)

few_shot_assertion_sentence_converter = medical.FewShotAssertionSentenceConverter()\
    .setInputCols(["document", "token","ner_chunk"])\
    .setOutputCol("assertion_sentence")

e5_embeddings = nlp.E5Embeddings.pretrained("e5_base_v2_embeddings_medical_assertion_base", "en", "clinical/models")\
    .setInputCols(["assertion_sentence"])\
    .setOutputCol("assertion_embedding")

embeddings_pipeline = nlp.Pipeline(
    stages = [
        document,
        token,
        chunk2doc,
        few_shot_assertion_sentence_converter,
        e5_embeddings
])

e5_base_v2_embeddings_medical_assertion_base download started this may take some time.
Approximate size to download 374.8 MB
[OK!]


In [17]:
assertion_test_data = embeddings_pipeline.fit(test_data).transform(test_data)
#assertion_test_data.write.mode('overwrite').parquet('i2b2_assertion_sample_test_data.parquet')

assertion_train_data = embeddings_pipeline.fit(training_data).transform(training_data)
#assertion_train_data.write.mode('overwrite').parquet('i2b2_assertion_sample_train_data.parquet')

## Graph setup

In [None]:
!pip install -q tensorflow==2.12.0
!pip install -q tensorflow-addons

We will use TFGraphBuilder annotator which can be used to create graphs in the model training pipeline.


In [19]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "./tf_graphs"
graph_name = "assertion_graph.pb"

assertion_graph_builder = medical.TFGraphBuilder()\
    .setModelName("fewshot_assertion")\
    .setInputCols(["assertion_embedding"]) \
    .setLabelColumn("label")\
    .setGraphFolder(graph_folder)\
    .setGraphFile(graph_name)\
    .setHiddenUnitsNumber(100)

fewshot_assertion_approach = medical.FewShotAssertionClassifierApproach()\
    .setInputCols("assertion_embedding")\
    .setOutputCol("assertion")\
    .setLabelCol("label")\
    .setBatchSize(32)\
    .setDropout(0.1)\
    .setLearningRate(0.001)\
    .setEpochsNumber(40)\
    .setValidationSplit(0.2)\
    .setModelFile(f"{graph_folder}/{graph_name}")

clinical_assertion_pipeline = nlp.Pipeline(
    stages = [
        assertion_graph_builder,
        fewshot_assertion_approach
])

In [20]:
%%time

assertion_model = clinical_assertion_pipeline.fit(assertion_train_data)

TF Graph Builder configuration:
Model name: fewshot_assertion
Graph folder: ./tf_graphs
Graph file name: assertion_graph.pb
Build params: {'input_dim': 768, 'output_dim': 2}
fewshot_assertion graph exported to ./tf_graphs/assertion_graph.pb
CPU times: user 3.39 s, sys: 7.48 s, total: 10.9 s
Wall time: 1min 45s


## Checking the results

Checking the results saved in the log file

In [21]:
preds = assertion_model.transform(assertion_test_data)\
                       .selectExpr('label','assertion.result[0] as result')

preds_df = preds.toPandas()
preds_df

Unnamed: 0,label,result
0,present,present
1,absent,absent
2,present,present
3,present,present
4,present,present
...,...,...
165,present,present
166,absent,absent
167,absent,absent
168,absent,absent


In [22]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report( preds_df['label'], preds_df['result']))

              precision    recall  f1-score   support

      absent       0.96      0.92      0.94        53
     present       0.97      0.98      0.97       117

    accuracy                           0.96       170
   macro avg       0.96      0.95      0.96       170
weighted avg       0.96      0.96      0.96       170



In [23]:
# save model
assertion_model.stages[-1].write().overwrite().save('custom_fewshot_assertion_model')

## Load saved model

**Build Pipeline**

In [24]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [25]:
few_shot_assertion_sentence_converter = medical.FewShotAssertionSentenceConverter()\
    .setInputCols(["sentence", "ner_chunk"])\
    .setOutputCol("assertion_sentence")

e5_embeddings = nlp.E5Embeddings.pretrained("e5_base_v2_embeddings_medical_assertion_base", "en", "clinical/models")\
    .setInputCols(["assertion_sentence"])\
    .setOutputCol("assertion_embedding")

few_shot_assertion_classifier = medical.FewShotAssertionClassifierModel.load("custom_fewshot_assertion_model")\
    .setInputCols(["assertion_embedding"])\
    .setOutputCol("assertion")


nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        few_shot_assertion_sentence_converter,
        e5_embeddings,
        few_shot_assertion_classifier
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

e5_base_v2_embeddings_medical_assertion_base download started this may take some time.
Approximate size to download 374.8 MB
[OK!]


In [26]:
text = 'Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. No alopecia and pain noted'

light_model = nlp.LightPipeline(model)

light_result = light_model.fullAnnotate(text)[0]

print(text)

chunks=[]
entities=[]
status=[]
confidence=[]

for n,m in zip(light_result['ner_chunk'],light_result['assertion']):

    chunks.append(n.result)
    entities.append(n.metadata['entity'])
    status.append(m.result)
    confidence.append(m.metadata['confidence'])

df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status, 'confidence':confidence})

df

Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. No alopecia and pain noted


Unnamed: 0,chunks,entities,assertion,confidence
0,a headache,PROBLEM,present,0.99713194
1,a head CT,TEST,present,0.9865289
2,anxious,PROBLEM,present,0.9947488
3,alopecia,PROBLEM,absent,0.97270274
4,pain,PROBLEM,absent,0.9721319


In [27]:
from sparknlp_display import AssertionVisualizer

assertion_visualiser = AssertionVisualizer()
assertion_visualiser.display(light_result, label_col ='ner_chunk', assertion_col='assertion')

# Train with External Embbedings

In [None]:
import pyspark.sql.types as T
import pyspark.sql.functions as F
import numpy as np
import torch

! pip install -q -U sentence-transformers
! pip install tensorflow_addons

class Annotation:
    def __init__(self, annotator_type, begin, end, result, metadata, embeddings):
        self.annotatorType = annotator_type
        self.begin = begin
        self.end = end
        self.result = result
        self.metadata = metadata
        self.embeddings = embeddings

    @staticmethod
    def getArrayType():
        return T.ArrayType(T.StructType([
            T.StructField('annotatorType', T.StringType(), False),
            T.StructField('begin', T.IntegerType(), False),
            T.StructField('end', T.IntegerType(), False),
            T.StructField('result', T.StringType(), False),
            T.StructField('metadata', T.MapType(T.StringType(), T.StringType()), False),
            T.StructField('embeddings', T.ArrayType(T.FloatType()), False)
        ]))

@F.udf(Annotation.getArrayType())
def add_embedding(row_id, raw_embedding):
    return [Annotation(
        annotator_type="sentence_embeddings",
        begin=0,
        end=0,
        result='chunk',
        metadata={'sentence': '0', 'isWordStart': 'true', 'pieceId': '-1', 'token': 'chunk'},
        embeddings=raw_embedding)]


def return_spark_df_with_sent_embeddings(df, sent_bert_model, sent_bert_model_name, batch_size=256):

  sentences = ["This is an example sentence"]

  embeddings_ = sent_bert_model.encode(sentences)

  dimension = embeddings_.shape[1]

  df['span'] = df['target'] +": "+df['text'].str.strip()

  embeddings = sent_bert_model.encode(df['span'].values, normalize_embeddings=True,
                          batch_size = batch_size, show_progress_bar= True )

  embeddings_array = np.array(embeddings)

  torch.cuda.empty_cache()

  spark_df = spark.createDataFrame([(i,  df['span'].values[i],list(map(float, embeddings_array[i])), df['label'].values[i]) for i in range(embeddings_array.shape[0])],['id', 'span','sent_embedding','label'])

  col_metadata = {
    "annotatorType": "sentence_embeddings",
    "sentence": 0,
    "dimension": dimension,
    'ref':sent_bert_model_name}

  spark_df = spark_df.withColumn("assertion_embedding", add_embedding(spark_df.id, spark_df.sent_embedding).alias("", metadata=col_metadata))

  return spark_df

In [29]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('intfloat/e5-base-v2', device='cpu')

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [30]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/i2b2_assertion_sample_short.csv

In [31]:
import pandas as pd

df=pd.read_csv("/content/i2b2_assertion_sample_short.csv")

df.head(3)

Unnamed: 0,text,target,label,start,end
0,"She has no history of liver disease , hepatitis .",liver disease,absent,5,6
1,1. Undesired fertility .,undesired fertility,present,1,2
2,3) STATUS POST FALL .,fall,present,3,3


In [32]:
from sklearn.model_selection import train_test_split
training_data, test_data = train_test_split(df, test_size=0.2, random_state=100)

In [33]:
training_data.groupby('label').size()

Unnamed: 0_level_0,0
label,Unnamed: 1_level_1
absent,185
present,527


In [34]:
test_data.groupby('label').size()

Unnamed: 0_level_0,0
label,Unnamed: 1_level_1
absent,43
present,136


In [35]:
training_data_df = return_spark_df_with_sent_embeddings(training_data, model, 'e5_base_v2')
test_data_df = return_spark_df_with_sent_embeddings(test_data, model, 'e5_base_v2')

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

## Graph setup

We will use TFGraphBuilder annotator which can be used to create graphs in the model training pipeline.


In [36]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_builder = medical.TFGraphBuilder()\
    .setModelName("fewshot_assertion")\
    .setInputCols(["assertion_embedding"]) \
    .setLabelColumn("label")\
    .setGraphFile("fewshot_assertion.pb")\
    .setGraphFolder("/tmp/assertion-graph")\
    .setHiddenLayers([])

fewshot_assertion_approach = (
    medical.FewShotAssertionClassifierApproach()
        .setInputCols("assertion_embedding")
        .setOutputCol("prediction")
        .setLabelColumn("label")
        .setModelFile("/tmp/assertion-graph/fewshot_assertion.pb")
        .setDropout(0.1)
        .setEpochsNumber(40)
        .setBatchSize(32)
        .setLearningRate(0.001))

assertion_pipeline = nlp.Pipeline(
    stages=[
        graph_builder,
        fewshot_assertion_approach])

In [37]:
%%time

assertion_model = assertion_pipeline.fit(training_data_df)

TF Graph Builder configuration:
Model name: fewshot_assertion
Graph folder: /tmp/assertion-graph
Graph file name: fewshot_assertion.pb
Build params: {'input_dim': 768, 'output_dim': 2, 'hidden_layers': []}
fewshot_assertion graph exported to /tmp/assertion-graph/fewshot_assertion.pb
CPU times: user 462 ms, sys: 52.5 ms, total: 514 ms
Wall time: 5.97 s


## Checking the results

Checking the results saved in the log file

In [38]:
results = assertion_model.transform(test_data_df).cache()
pred_df = (results
        .selectExpr("explode(prediction) as prediction", "label")
        .selectExpr("prediction.result as prediction", "label")).toPandas()

pred_df

Unnamed: 0,prediction,label
0,present,present
1,present,present
2,present,present
3,absent,absent
4,present,present
...,...,...
174,present,present
175,present,present
176,present,present
177,present,present


In [39]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report(pred_df.label, pred_df.prediction))

              precision    recall  f1-score   support

      absent       0.89      0.58      0.70        43
     present       0.88      0.98      0.93       136

    accuracy                           0.88       179
   macro avg       0.89      0.78      0.82       179
weighted avg       0.88      0.88      0.87       179



In [40]:
# save model
assertion_model.stages[-1].write().overwrite().save('custom_fewshot_assertion_model')

## Load saved model

### **Export Embedings to SparkNLP**

In [None]:
!pip install -q --upgrade transformers[onnx]==4.29.1 optimum tensorflow

In [42]:
from optimum.onnxruntime import ORTModelForFeatureExtraction

MODEL_NAME = "intfloat/e5-base-v2"
EXPORT_PATH = f"onnx_models/{MODEL_NAME}"

ort_model = ORTModelForFeatureExtraction.from_pretrained(MODEL_NAME, export=True)

# Save the ONNX model
ort_model.save_pretrained(EXPORT_PATH)

# Create directory for assets and move the tokenizer files.
# A separate folder is needed for Spark NLP.
!mkdir {EXPORT_PATH}/assets
!mv {EXPORT_PATH}/vocab.txt {EXPORT_PATH}/assets/

In [43]:
!ls -l {EXPORT_PATH}

total 426320
drwxr-xr-x 2 root root      4096 Nov 13 14:18 assets
-rw-r--r-- 1 root root       660 Nov 13 14:18 config.json
-rw-r--r-- 1 root root 435820911 Nov 13 14:18 model.onnx
-rw-r--r-- 1 root root       695 Nov 13 14:18 special_tokens_map.json
-rw-r--r-- 1 root root      1190 Nov 13 14:18 tokenizer_config.json
-rw-r--r-- 1 root root    711396 Nov 13 14:18 tokenizer.json


In [44]:
!ls -l {EXPORT_PATH}/assets

total 228
-rw-r--r-- 1 root root 231508 Nov 13 14:18 vocab.txt


In [45]:
MODEL_NAME = "intfloat/e5-base-v2"
EXPORT_PATH = f"onnx_models/{MODEL_NAME}"

# All these params should be identical to the original ONNX model
E5 = nlp.E5Embeddings.loadSavedModel(f"{EXPORT_PATH}", spark)\
    .setInputCols(["document"])\
    .setOutputCol("E5")

In [46]:
E5.write().overwrite().save(f"{MODEL_NAME}_spark_nlp")

In [47]:
! ls -l {MODEL_NAME}_spark_nlp

total 425680
-rw-r--r-- 1 root root 435887550 Nov 13 14:18 e5_onnx
drwxr-xr-x 3 root root      4096 Nov 13 14:18 fields
drwxr-xr-x 2 root root      4096 Nov 13 14:18 metadata


### **Build Pipeline**

In [48]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [49]:
few_shot_assertion_sentence_converter = medical.FewShotAssertionSentenceConverter()\
    .setInputCols(["sentence", "ner_chunk"])\
    .setOutputCol("assertion_sentence")

e5_embeddings = nlp.E5Embeddings.load(f"{MODEL_NAME}_spark_nlp")\
    .setInputCols(["assertion_sentence"])\
    .setOutputCol("assertion_embedding")

few_shot_assertion_classifier = medical.FewShotAssertionClassifierModel.load("custom_fewshot_assertion_model")\
    .setInputCols(["assertion_embedding"])\
    .setOutputCol("assertion")


nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        few_shot_assertion_sentence_converter,
        e5_embeddings,
        few_shot_assertion_classifier
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [50]:
text = 'Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. No alopecia and pain noted'

light_model = nlp.LightPipeline(model)

light_result = light_model.fullAnnotate(text)[0]

print(text)

chunks=[]
entities=[]
status=[]
confidence=[]

for n,m in zip(light_result['ner_chunk'],light_result['assertion']):

    chunks.append(n.result)
    entities.append(n.metadata['entity'])
    status.append(m.result)
    confidence.append(m.metadata['confidence'])

df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status, 'confidence':confidence})

df

Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. No alopecia and pain noted


Unnamed: 0,chunks,entities,assertion,confidence
0,a headache,PROBLEM,present,0.8825919
1,a head CT,TEST,present,0.87244517
2,anxious,PROBLEM,present,0.8725356
3,alopecia,PROBLEM,absent,0.8105469
4,pain,PROBLEM,absent,0.83024746


In [51]:
from sparknlp_display import AssertionVisualizer

assertion_visualiser = AssertionVisualizer()
assertion_visualiser.display(light_result, label_col ='ner_chunk', assertion_col='assertion')